Using MongoDB to store and retrieve CSV files content in Ruby

There come cases, when we want to store CSV or any other sort of files data in a database. The problem occurs when the input files differs (they might have different columns). This would not be a problem if we would be able to know the files specification before parsing and inserting into DB. One of the solutions (the fastest) would be to store them in separate DB tables (each filetype / table). But what we should do, when column amount and their names are unknown? We could use SQL database, create table for data and another table for mapping appropriate DB columns to CSV columns. This might work, but it would not be an elegant solution. So what could we do?

MongoDB to the rescue!

This case is just perfect for MongoDB. MongoDB  is a scalable, high-performance, open source NoSQL database that allows us to create documents what have different attributes assigned to them. To make it work with Ruby you just need to add this to your gem file:

gem "mongoid"

Then you need to specify Mongoid yaml config file and you are ready to go (the MongoDB installation instructions can be found here):

Mongoid.load!("./mongoid.yml", :production)

MongoID example config file

Here you have really small Mongoid config file:

production:
  sessions:
    default:
      hosts:
        - localhost:27017
      database: csv_files
      username: csv
      password: csv
  options:
    allow_dynamic_fields: true
    raise_not_found_error: false
    skip_version_check: false

You can use your own config file, just remember to set allow_dynamic_fields to true!

CSV parsing

We will do a really simple CSV parsing. We will store all the values as strings, so we just need to read CSV file and create all needed attributes in objects that should represent each file row:

class StoredCSV
  include Mongoid::Document
  include Mongoid::Timestamps

  def self.import!(file_path)
    columns = []
    instances = []
    CSV.foreach(file_path) do |row|
      if columns.empty?
        # We dont want attributes with whitespaces
        columns = row.collect { |c| c.downcase.gsub(' ', '_') }
        next
      end

      instances << create!(build_attributes(row, columns))
    end
    instances
  end

  private

  def self.build_attributes(row, columns)
    attrs = {}
    columns.each_with_index do |column, index|
      attrs[column] = row[index]
    end
    attrs
  end
end

Thats all! Instead of creating SQL tables and doing some mappings – we just allow MongoDB to dynamically create all the fields that we need for given CSV file.

Usage

StoredCSV.import!('data.csv')
stored_data = StoredCSV.all

Checking attribute names for given object – don’t use Mongoid attribute_names method

You need to remember, that various instances might have (since they come from different files) different attributes, so you cannot just assume that all will have field “name”. There is a Mongoid method attribute_names, but this method will return only predefined attributes:

StoredCSV.first.attribute_names => ["_type", "_id", "created_at", "updated_at"]

To obtain all the fields for given instance you need to do something like this

StoredCSV.first.attributes.collect{|k,v| k} => ["_id", "name", "format", "description"]

Summary

This was just a simple example but this should also be a good base for a bigger and better solution. There should be implemented more complex key extracting mechanism with prefix (this would protect from reassigning protected values (like “_id”) and whole bunch of other improvements ;)

  • http://www.sharpbites.com sharpbites

    Can you elaborate on when you find it useful to store the files in the database?

  • http://www.mensfeld.pl Maciej Mensfeld

    I don’t store “pure” files – I store data that was in those files. When this is useful? In my case I’m doing some data manipulations (mappings, transformation, etc) and it is much easier to do this with data stored in DB.

  • Jared

    Why not map the the columns to ruby objects and marshalize? Maybe I just don’t understand the use-case.

  • http://www.mensfeld.pl Maciej Mensfeld

    Because doing stuff like distincts/sums/etc would be a bit of pain with Ruby objects.