Thursday, March 22, 2012

Importing Files Into MongoDB GridFS With Python

Importing files into MongoDB GridFS is a trivial task with Python. In this post I will illustrate the necessary steps to accomplish this. In subsequent entries I will describe efficient ways to query those files and associate them with other collections in a MongoDB database.

I assume you have familiarity with MongoDB and GridFS. In the absence of that I recommend you start with the GridFS documentation.

Prerequisites

The MongoDB Python driver must be present. I installed it on a Mac OS X (Lion) box with easy_install. Please note that there are other installation options.

$ easy_install pymongo

With the driver installed, there are only two concepts to illustrate: reading a file from the file system -and- using the Python MongoDB driver to store it in GridFS.

Details

To open a file for reading you can use the open(file, filemode) function which returns a file object.

file = open("my_file_name", 'r')  

Using the GridFS store in MongoDB is also fairly simple. The general steps are: (1) open a connection to the server, (2) get the target database, (3) initialize a GridFS object with the database reference, and (4) invoke the GridFS.put() function to store the file.

Step 1 - Connect to the server using the Python Mongo driver. This illustrates connecting to your local development instance on the default port used by MongoDB.

connection = pymongo.Connection( "localhost", 27017)

Step 2 - Obtain a reference to the database on which the file(s) will be stored using the GridFS API. Note that a MongoDB instance holds one or more databases, each with one or more collections.

db = connection.yourdatabase

Step 3 - Create a GridFS object using a reference to the database on which to store the file(s).

gridFs = gridfs.GridFS(db)

All that's left now is to invoke the "put" function to store the file. This function takes one or more keyword arguments which are used by the GridIn class to assign attributes to the stored file or to specify other storage characteristics. For more details see the PyMongo documentation.

file_id = gridFs.put( file.read(), filename="my_file_name")

In this case we have passed the "filename" keyword to let GridFS know that we want the file to be stored with this file name ("my_file_name").

The put functions returns the "_id" of the newly created file. This can be used to associate GridFS files with other collections objects.

Closing up

I used this approach to import a large number of TIFF files. With Python's simple and succinct syntax, along with the elegant PyMongo driver implementation, this task was accomplished with 30 lines of code, including error handling.

In subsequent posts I will describe how to associate files with existing documents in a different collection in an efficient manner.



1 comment: