Thursday, November 18, 2010

Reading Excel files using JRuby and Apache POI

In the code below, I am reading Microsoft Excel files using JRuby and the Apache POI library.  You will need the following, which you can get from http://apache.org/:


  • commons-logging-1.1.jar
  • dom4j-1.6.1.jar
  • geronimo-stax-api_1.0_spec-1.0.jar
  • junit-3.8.1.jar
  • log4j-1.2.13.jar
  • poi-3.6-20091214.jar
  • poi-contrib-3.6-20091214.jar
  • poi-examples-3.6-20091214.jar
  • poi-ooxml-3.6-20091214.jar
  • poi-ooxml-schemas-3.6-20091214.jar
  • poi-scratchpad-3.6-20091214.jar
  • xmlbeans-2.3.0.jar

I don't remember which ones I used, but those are the ones which were in the library when it worked.  According to the code below, you will want to put those in "lib/java_classes." However, you can put them wherever.

The code below outputs the Excel spreadsheet as an array:

Dir.new("lib/java_classes").sort.each do | file |
  require("lib/java_classes/" + file) if file =~ /\.jar$/
end

include_class Java::org.apache.poi.poifs.filesystem.POIFSFileSystem
include_class Java::org.apache.poi.poifs.filesystem.POIFSFileSystem

include_class Java::java.io.ByteArrayOutputStream
include_class Java::java.util.Date


require 'yaml'

args = YAML::load(STDIN)

input = Java::JavaIo::FileInputStream.new(args["file"])

include_class Java::org.apache.poi.ss.usermodel.WorkbookFactory
include_class Java::org.apache.poi.ss.usermodel.Workbook
include_class Java::org.apache.poi.ss.usermodel.Sheet
include_class Java::org.apache.poi.ss.usermodel.Row
include_class Java::org.apache.poi.ss.usermodel.Cell
include_class Java::org.apache.poi.ss.usermodel.FormulaEvaluator

@spreadsheet = WorkbookFactory.create(input)
@formula_evaluator = @spreadsheet.get_creation_helper.create_formula_evaluator


def parse_value(cell)
  return nil if cell.nil?

  case cell.get_cell_type
  when Cell.CELL_TYPE_NUMERIC then cell.get_numeric_cell_value
  when Cell.CELL_TYPE_STRING then cell.get_string_cell_value
  when Cell.CELL_TYPE_FORMULA then parse_formula(@formula_evaluator.evaluate(cell))
  when Cell.CELL_TYPE_BOOLEAN then cell.get_boolean_cell_value
  when Cell.CELL_TYPE_BLANK then nil
  else
    raise "Don't know what to do with cell type: #{cell.get_cell_type}"
  end
end

def parse_formula(cell_value)
  case cell_value.get_cell_type
  when Cell.CELL_TYPE_NUMERIC then cell_value.get_number_value
  when Cell.CELL_TYPE_STRING then cell_value.get_string_value
  when Cell.CELL_TYPE_BOOLEAN then cell_value.get_boolean_value
  when Cell.CELL_TYPE_BLANK then nil
  else
    raise "Don't know what to do with cell type: #{cell_value.get_cell_type}"
  end
end

spreadsheet_as_array = ([email protected]_number_of_sheets - 1).map do | sheet_num |
  sheet = @spreadsheet.sheet_at(sheet_num)
  sheet_as_array = [sheet.sheet_name, []]

  sheet.row_iterator.each do | row |
    sheet_as_array[1] << []

    row.cell_iterator.each do | cell |
      sheet_as_array[1].last << parse_value(cell)
    end
  end

  sheet_as_array
end

puts spreadsheet_as_array.inspect

There is a flaw to this code above: an Excel sheet needs to be "cleaned" prior to importing.  Excel files typically get tons of extra rows and tons of extra columns associated with the Active Cells.  People apply formatting to unused areas by selecting all, etc.  Therefore, go in and delete unused cells, and do "Save As" to get Excel to forget formatting on those cells.

To run the code above, do the following:

echo 'file: my_spreadsheet.xls' | jruby xls_to_array.rb

No comments: