Thursday, November 18, 2010

Reading Excel files using JRuby and Apache POI

In the code below, I am reading Microsoft Excel files using JRuby and the Apache POI library.  You will need the following, which you can get from http://apache.org/:


  • commons-logging-1.1.jar
  • dom4j-1.6.1.jar
  • geronimo-stax-api_1.0_spec-1.0.jar
  • junit-3.8.1.jar
  • log4j-1.2.13.jar
  • poi-3.6-20091214.jar
  • poi-contrib-3.6-20091214.jar
  • poi-examples-3.6-20091214.jar
  • poi-ooxml-3.6-20091214.jar
  • poi-ooxml-schemas-3.6-20091214.jar
  • poi-scratchpad-3.6-20091214.jar
  • xmlbeans-2.3.0.jar

I don't remember which ones I used, but those are the ones which were in the library when it worked.  According to the code below, you will want to put those in "lib/java_classes." However, you can put them wherever.

The code below outputs the Excel spreadsheet as an array:

Dir.new("lib/java_classes").sort.each do | file |
  require("lib/java_classes/" + file) if file =~ /\.jar$/
end

include_class Java::org.apache.poi.poifs.filesystem.POIFSFileSystem
include_class Java::org.apache.poi.poifs.filesystem.POIFSFileSystem

include_class Java::java.io.ByteArrayOutputStream
include_class Java::java.util.Date


require 'yaml'

args = YAML::load(STDIN)

input = Java::JavaIo::FileInputStream.new(args["file"])

include_class Java::org.apache.poi.ss.usermodel.WorkbookFactory
include_class Java::org.apache.poi.ss.usermodel.Workbook
include_class Java::org.apache.poi.ss.usermodel.Sheet
include_class Java::org.apache.poi.ss.usermodel.Row
include_class Java::org.apache.poi.ss.usermodel.Cell
include_class Java::org.apache.poi.ss.usermodel.FormulaEvaluator

@spreadsheet = WorkbookFactory.create(input)
@formula_evaluator = @spreadsheet.get_creation_helper.create_formula_evaluator


def parse_value(cell)
  return nil if cell.nil?

  case cell.get_cell_type
  when Cell.CELL_TYPE_NUMERIC then cell.get_numeric_cell_value
  when Cell.CELL_TYPE_STRING then cell.get_string_cell_value
  when Cell.CELL_TYPE_FORMULA then parse_formula(@formula_evaluator.evaluate(cell))
  when Cell.CELL_TYPE_BOOLEAN then cell.get_boolean_cell_value
  when Cell.CELL_TYPE_BLANK then nil
  else
    raise "Don't know what to do with cell type: #{cell.get_cell_type}"
  end
end

def parse_formula(cell_value)
  case cell_value.get_cell_type
  when Cell.CELL_TYPE_NUMERIC then cell_value.get_number_value
  when Cell.CELL_TYPE_STRING then cell_value.get_string_value
  when Cell.CELL_TYPE_BOOLEAN then cell_value.get_boolean_value
  when Cell.CELL_TYPE_BLANK then nil
  else
    raise "Don't know what to do with cell type: #{cell_value.get_cell_type}"
  end
end

spreadsheet_as_array = ([email protected]_number_of_sheets - 1).map do | sheet_num |
  sheet = @spreadsheet.sheet_at(sheet_num)
  sheet_as_array = [sheet.sheet_name, []]

  sheet.row_iterator.each do | row |
    sheet_as_array[1] << []

    row.cell_iterator.each do | cell |
      sheet_as_array[1].last << parse_value(cell)
    end
  end

  sheet_as_array
end

puts spreadsheet_as_array.inspect

There is a flaw to this code above: an Excel sheet needs to be "cleaned" prior to importing.  Excel files typically get tons of extra rows and tons of extra columns associated with the Active Cells.  People apply formatting to unused areas by selecting all, etc.  Therefore, go in and delete unused cells, and do "Save As" to get Excel to forget formatting on those cells.

To run the code above, do the following:

echo 'file: my_spreadsheet.xls' | jruby xls_to_array.rb

Statistically Improbable Cheating

I recently ran across an article on finding test cheaters with statistics(http://www.telegraph.co.uk/news/newsvideo/weirdnewsvideo/8140456/200-students-admit-cheating-after-professors-online-rant.html), and I remembered an event I had when I was a teacher.  Then, I recovered this from a message on 03/16/2005 when I was a teacher:

I had my students a test recently that contained 65 multiple choice questions. Also, I had a student tell me  my class was easy, so I considered that a challenge to whoop the students with the next test.
So students are taking the test (which I give on computers, and it grades automagically) and the grades are coming back between 40's and 65's.  Then some more classes come in and they start making between 40's and 80's.  So the final average average is about 55.  There is a good equal distribution so I can work with it.


I start figuring out how I am going to scale the test (and in public schools they frown on true Bell Curves.  They are really looking for a line with about 5% failing and 30% A's. ). So I do some computer stuff and figure that I will split the hardest questions from the easiest on a scale with 5 steps.  The hardest are worth 1.8 and the easiest are worth 2.2.  That gives me a good line, and theoretically removes any bonuses for guessing correctly.


So, I get this knocked out. Finally I decided to write a little script to see if anyone is cheating.  I examine everyone's answers, and the odds of two people having 60 of 65 of the same answers on the test is highly unlikely. Seeing how most only got 55% correct, correlations between two different test-takers should not be there.


After I run this script, I get 4 sets of students that have 60 similar answers. 3 of the 4 sets sit right next to each other.  1 of the 4 sets has one student that sits directly behind the other.  There are no other correlations.


Convenient huh?


The odds of two students out 100 that are in of five classes sitting directly next to each other randomly putting the same answers for 60 of 65 four choice multiple questions the exact same is .00064%.


I think it was just brain waves interfering.

Wednesday, November 17, 2010

Using Math with Ruby Ranges

Back in August, I wrote a little Ruby library that extends math functions to ranges (https://github.com/Winslett/RubyRanges). The sole purpose of the library was to easily work with schedules and conflicts.

For instance, given I am available from 2 to 6, yet have scheduled events at 3 to 4 and 4:30 to 5:15, I could use simple math functions like the following:

  (2010-11-17 14:00:00..2010-11-17 18:00:00) - (2010-11-17 15:00:00..2010-11-17 16:00:00) - (2010-11-17 16:30:00..2010-11-17 17:15:00)

Should equal an array of ranges of times the person is available:

  [(2010-11-17 14:00:00..2010-11-17 15:00:00),(2010-11-17 16:00:00..2010-11-17 16:30:00),(2010-11-17 17:15:00..2010-11-17 18:00:00)]

Please take the library. I look to convert it into a Gem soon, but job comes first.

Chris

Tuesday, November 16, 2010

Automating Microsoft Word with IronRuby

Below some snippets with details of a project I worked on for documentation creation workflow.   A core focus of the project was reduce the number of steps needed by a person to automate a document creation and distribution process.

At the basis of the organizational knowledge is Microsoft Word and Excel, therefore, we stuck with Microsoft Word and Excel.  Given I introduced more base technology, I would have to support more base technology.  Therefore, I tried to minimize change and maximize return-on-technology.

Each document was a mail merge file in Microsoft Word.  Each mail merged document had multiple receipts, and the routing and packaging of documents could be either physical or E-mail.  The physical distribution had a variable number of copies based on receipts, and the E-mail distribution had a single copy with a variable number of recipients.

Now for the code:

Below loads the information into IronRuby:
require 'Microsoft.Office.Interop.Word, Version=12.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c'

Below I abstracted the Microsoft Word interface:
class Word
  attr_accessor :word, :documents

  def self.connect
    word = Word.new
    word.word = System::Runtime::InteropServices::Marshal.get_active_object("Word.Application")
    word.documents = (1..word.word.Documents.Count).inject([]) do | documents, num |
    documents && Document.new(:document => word.word.Documents[num])
    end
    word
  end

  def self.open
    word = Word.new
    word.word = Microsoft::Office::Interop::Word::ApplicationClass.new
    word.visible
    word.documents = []
    word
  end

  def visible
    word.visible = true
  end

  def close(args = {})
    args[:force] ||= false
    word.Quit(args[:force])
  end

  def printers
  System::Drawing::Printing::PrinterSettings.InstalledPrinters.map { | p | p }
  end

  def active_printer
  word.ActivePrinter
  end

  def active_printer=(arg)
  word.ActivePrinter = arg
  end

  def disable_alerts
  word.DisplayAlerts = Microsoft::Office::Interop::Word::WdAlertLevel::wdAlertsNone
  end

  def enable_alerts
  word.DisplayAlerts = Microsoft::Office::Interop::Word::WdAlertLevel::wdAlertsAll
  end
  
end

Below is the Document object. While working, I originally started with Document controlling the Word and Document objects, but eventually realized I needed to split the two:

class Document
  attr_accessor :document

  def initialize(args = {})
   self.document = args[:document]
  end

  def name
    document.Name
  end

  def ToString
    name
  end

  def self.find(*args)
    @@word ||= Word.connect

    case
    when args[0].is_a?(Fixnum) then
      @@word.documents[args[0]] || raise(WordErrors::DocumentNotFound, args.inspect)
    when args[0].is_a?(String) then
      @@word.documents.find { | document | document.name == args[0] } || raise(WordErrors::DocumentNotFound, args.inspect)
    when args[0] == :all then
      @@word.documents
    else
      raise(WordErrors::DocumentNotFound, args.inspect)
    end
  end

  def self.open(file_path = nil)
    @@word ||= Word.open

    document = Document.new(
      :document => !file_path.nil? ?
      @@word.word.Documents.Open(System::String.new(file_path)) : @@word.word.Documents.Add
    )

    @@word.documents << document
    document
  end

  def mailmerge?
    document.MailMerge.State != 0
  end

  def close(args)
    save_changes = args[:force] ? false : true
    @@word.documents.delete(self)
    document.Close(save_changes)

    if @@word.documents.length == 0
      @@word.close(:force => true)
      @@word = nil
    end
  end

  #
  # record => Fixnum, :first, :last, :previous, :next, [Fixnums]
  #
  def goto(record)
    return true if record == :current || current_record == record

    case
    when record.is_a?(Fixnum)
      document.MailMerge.DataSource.ActiveRecord = record
    when record == :last
      document.MailMerge.DataSource.ActiveRecord = record_count
    when record == :first
      document.MailMerge.DataSource.ActiveRecord = 1
    when record == :next
      document.MailMerge.DataSource.ActiveRecord = current_record + 1
    when record == :previous
      document.MailMerge.DataSource.ActiveRecord = current_record - 1
    else
      raise(WordErrors::RecordNotFound, "Cannot use '#{record}' to find a record")
    end
  rescue
    raise WordErrors::NotMailMerge unless mailmerge?
  end

  #
  # Expects the following variables to be passed:
  #
  # action => [:print, :save]
  #
  # args =>
  #
  #   :records           => [:all, :current, :next, :previous, :first, :last]
  #   :path              => path to the directory which will save the files
  #   :directory_formula => The formula for the directory based on the
  #                         Investors datasource values '#{Inv} - #{Investor}'
  #   :name              => The file name you wish to give the document
  #
  #
  def export(action, args)
    set_action(action)

    if args[:records] == :all
      1.upto(record_count) do | record |
        export(action, args.merge(:records => record))
        @@word.word.ActiveDocument.Close(0)
      end
    else # For all other possibilities
      goto(args[:records])
      export_current
      save(args) if action == :save
      Document.new(:document => @@word.word.ActiveDocument)
    end
  end

  def save(args)
    location = create_directory(args[:path], args[:directory_formula])
    
    url = System::String.new(File.windows_join(location, args[:name]))

    puts "Saving: #{url}"

    case
    when args[:name].downcase =~ /\.pdf$/ then save_as_pdf(url)
    when args[:name].downcase =~ /\.doc$/ then save_as_doc(url)
    else
      raise "Don't know how to save: #{location}"
    end
  rescue Exception => e
    raise WordErrors::CouldNotSave, "Error saving to #{location}\\#{args[:name]}", [e.to_s] + e.backtrace
  end

  #
  # Holder for an array of all the data fields in the DataSource
  #
  def data_fields
    @data_fields ||= return_data_fields
  end

  #
  # Returns the number of records in a DataSource
  #
  def record_count
    @document.MailMerge.DataSource.RecordCount
  end

  #
  # Returns the Row Value for a Particular Column of the
  # DataSource
  def field_value(index)
    index = data_fields.index(index) + 1 if index.is_a?(String)
    @document.MailMerge.DataSource.DataFields(index).Value
  rescue
    raise WordErrors::NotMailMerge unless mailmerge?
  end


  #
  # Pass in a formula for fields and it finds and replaces the fields
  #
  def replace_fields_with_values(formula)
    while formula =~ /\#\{([^\}]+)\}/ do
      replaced_string = $&
      escaped_string  = field_value($1)
      formula.gsub!(replaced_string, escaped_string)
    end
    formula
  end

  def save_as_pdf(url)
    @@word.word.ActiveDocument.ExportAsFixedFormat(url, Microsoft::Office::Interop::Word::WdExportFormat::wdExportFormatPDF)
  end

  def print(args = {})
  copies = args[:copies] ? args[:copies].to_i : 1
  document.PrintOut(false)
  end

  private
  #
  # Creates the Directory to the current record based on the
  # path created by the directory formula
  #
  def create_directory(path, dir_formula)
    dir_name = directory_name_from_formula(dir_formula)
    dir_path = File.windows_join(path, dir_name)

    FileUtils.mkdir_p(dir_path) unless File.exists?(dir_path)

    dir_path
  end

  #
  # actions = [:save, :print]
  #
  def set_action(action)
    document.MailMerge.Destination = case

    when action == :print then 1
    when action == :save then 0
    else
      raise WordErrors::InvalidAction, action
    end
  end

  #
  # Performs the current action as setup by other variables.
  # Including set_action.  This action will only be used with
  # the :save action, since print automatically prints records
  #
  def export_current
    document.MailMerge.DataSource.FirstRecord = current_record
    document.MailMerge.DataSource.LastRecord = current_record
    document.MailMerge.Execute(false) # false is whether to "Pause"
  end

  #
  # Generates a name for the directory based on the values from
  # the record and a formula given by the user
  #
  def directory_name_from_formula(dir_formula)
    dir_name = dir_formula.clone
    replace_fields_with_values(dir_name).gsub(/[^A-Za-z0-9\ \-\+\.\,]+/, "-")[0..64].strip
  end

  #
  # Returns an array of all the data fields in the DataSource
  def return_data_fields
    @data_fields = []
    1.upto(@document.MailMerge.DataSource.DataFields.Count) do | i |
      @data_fields << @document.MailMerge.DataSource.DataFields.Item(i).Name.strip
    end
    @data_fields
  end

  #
  # Returns the numeric value of the ActiveRecord
  #
  def current_record
    document.MailMerge.DataSource.ActiveRecord
  end

  def save_as_doc(url)
    @@word.word.ActiveDocument.SaveAs(url)
  end

end

Working to scrub the data of any domain specific content, then I'll post the full code. If you have any questions, please E-mail me, and I'll answer them.

Chris