Software Ramblings
The wonderful world of GEO spatial indexes in MongoDB

The wonderful world of GEO spatial indexes in MongoDB

MongoDB has native support for geospatial indexes and extensions to the query language to support a lot of different ways of querying your geo spatial documents. We will touch on a all of the available features of the MongoDB geospatial support point by point as outlined below.

  • Query $near a point with a maximum distance around that point
  • Set the minimum and maximum range for the 2d space letting you map any data to the space
  • GeoNear command lets you return the distance from each point found
  • $within query lets you set a shape for you query letting you use a circle, box or arbitrary polygon, letting you map complex geo queries such as congressional districts or post code zones.

But first let’s cover the basics of getting you up and running starting with what a document needs to look like for the indexing to work.

Geospatialize your documents

Somehow we need to tell MongoDB what fields represent our geospatial coordinates. Luckily for us this is very simple. Lets take a simple sample document representing the best imaginative Burger place in the world.

var document = {
  name: "Awesome burger bar"      
}

Not we need know that it’s located on the fictitious planet (Burgoria) and more specifically at the coordinates [50, 50]. So how do we add this to the document so we can look it up using geospatial searches ? Well it’s very simple just add it as a field as shown below.

var document = {
  name: "Awesome burger bar",
  loc: [50, 50]      
}

Easy right? The only thing you have to ensure is that the first coordinate is the x coordinate and the second one is the y coordinate [x, y].

Let’s go ahead and connect to the database and insert the document

var Db = require('mongodb').Db;

var document = {
  name: "Awesome burger bar",
  loc: [50, 50]      
}

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)

  db.collection('places').insert(document, {w:1}, function(err, result) {
    if(err) return console.dir(err)
  });
});

So now we have a document in our collection. We now need to tell MongoDB to index our collection and create a 2D index on our loc attribute so we can avail us of the awesome geospatial features. This turns out to be easy as well. Let’s modify the code to ensure we have the index on startup.

var Db = require('mongodb').Db;

var document = {
  name: "Awesome burger bar",
  loc: [50, 50]      
}

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)
  var collection = db.collection('places');

  collection.ensureIndex({loc: "2d"}, {min: -500, max: 500, w:1}, function(err, result) {
    if(err) return console.dir(err);

    collection.insert(document, {w:1}, function(err, result) {
      if(err) return console.dir(err)
    });
  });
});

ensureIndex does the trick creating the index if it does not already exist. By specifying {loc: “2d”} MongoDB will index the array contained in every document under the field name loc. The min and max defines the boundaries of our (Burgoria) and means that points outside -500 and 500 will throw an error as it’s not on the planet.

Basic queries for your geospatial documents

Since we now have a geospatial index on our collection let’s play around with the query methods and learn how we can work with the data. First however let’s add some more documents so we can see the effects of the different boundaries.

var Db = require('mongodb').Db;

var documents = [
    {name: "Awesome burger bar", loc: [50, 50]}
  , {name: "Not an Awesome burger bar", loc: [10, 10]}
  , {name: "More or less an Awesome burger bar", loc: [45, 45]}
]

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)
  var collection = db.collection('places');

  collection.ensureIndex({loc: "2d"}, {min: -500, max: 500, w:1}, function(err, result) {
    if(err) return console.dir(err);

    collection.insert(documents, {w:1}, function(err, result) {
      if(err) return console.dir(err)
    });
  });
});

Right from now one for brevities sake we are going to assume we have the documents stored in the collection and the index created so we can work on queries without the boilerplate insert and index creation code. The first thing we are going to do is locate all the documents that’s a distance of 10 away from 50, 50.

var Db = require('mongodb').Db,
  assert = require('assert');

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)

  db.collection('places').find({loc: {$near: [50,50], $maxDistance: 10}}).toArray(function(err, docs) {
    if(err) return console.dir(err)

    assert.equal(docs.length, 2);
  });
});

This returns the following results (ignore the _id it will be different as it’s a collection assigned key).

{ "_id" : 509a47337d6ab61b2871ee8e, "name" : "Awesome burger bar", "loc" : [ 50, 50 ] }
{ "_id" : 509a47337d6ab61b2871ee90, "name" : "More or less an Awesome burger bar", "loc" : [ 45

Let’s look at the query. $near specifies the center point for the geospatial query and $maxDistance the radius of the search circle. Given this the query will return the two documents at [50, 50] and [10, 10]. Now this is a nice feature but what if we need to know the distance from each of the found documents to the originating center for our query. Luckily we have a command that support that called geoNear. Let’s execute it and look at the results.

var Db = require('mongodb').Db,
  assert = require('assert');

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)

  db.collection('places').geoNear(50, 50, {$maxDistance:10}, function(err, result) {
    if(err) return console.dir(err)

    assert.equal(result.results, 2);
  });
});

Let’s look at the results returned by the query.

{
  "ns" : "test.places",
  "near" : "1100000011110000111100001111000011110000111100001111",
  "results" : [
    {
      "dis" : 0,
      "obj" : {
        "_id" : 509a47337d6ab61b2871ee8e,
        "name" : "Awesome burger bar",
        "loc" : [
          50,
          50
        ]
      }
    },
    {
      "dis" : 7.0710678118654755,
      "obj" : {
        "_id" : 509a47337d6ab61b2871ee90,
        "name" : "More or less an Awesome burger bar",
        "loc" : [
          45,
          45
        ]
      }
    }
  ],
  "stats" : {
    "time" : 0,
    "btreelocs" : 0,
    "nscanned" : 2,
    "objectsLoaded" : 2,
    "avgDistance" : 3.5355339059327378,
    "maxDistance" : 7.071128503792992
  },
  "ok" : 1
}

Notice that geoNear is a command not a find query so it returns a single document with the results in the results field of the returned document. As we can see from the results each returned result has a field called dis that is the distance of the document from the center point of our search. Cool we’ve now covered the basics of geospatial search so let’s move onto more advanced queries.

Advanced queries for your geospatial documents

So besides these simple queries we can also do bounds queries. With bounds queries we mean we can look for points of interest inside a defined boundary. This can be useful if you have such things as a post code area, congressional district or any sort of bounding box that is not a pure circle (say look for all restaurants in the west village in new york). Let’s go through the basics.

The magical boundry box query

Our country Whopper on Burgoria is a perfectly bound box (imagine that). Our application wants to restrict our searches to only burger bars in Burgonia. The boundaries for Burgonia are defined by (30, 30) -> (30, 60) and (30, 60) -> (60, 60). Great let’s peform a box bounded query.

var Db = require('mongodb').Db,
  assert = require('assert');

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)
  var box = [[30, 30], [60, 60]];

  db.collection('places').find({loc: {$within: {$box: box}}).toArray(function(err, docs) {
    if(err) return console.dir(err)

    assert.equal(docs.length, 2);
  });
});

The results returned are.

{ "_id" : 509a47337d6ab61b2871ee8e, "name" : "Awesome burger bar", "loc" : [ 50, 50 ] }
{ "_id" : 509a47337d6ab61b2871ee90, "name" : "More or less an Awesome burger bar", "loc" : [ 45

A polygon to far

Awesome we can now do a query by our perfectly boxed country. Inside Whopper the country is split into triangles where triangle one is made up of three points (40, 40), (40, 50), (45, 45). We want to look for points that are only inside this triangle. Let’s have a look at the query.

var Db = require('mongodb').Db,
  assert = require('assert');

Db.connect("mongodb://localhost:27017/geodb", function(err, db) {
  if(err) return console.dir(err)
  var triangle = [[40, 40], [40, 50], [45, 45]];

  db.collection('places').find({loc: {$within: {$polygon: triangle}}).toArray(function(err, docs) {
    if(err) return console.dir(err)

    assert.equal(docs.length, 2);
  });
});

The results returned are.

{ "_id" : ObjectId("509a47337d6ab61b2871ee90"), "name" : "More or less an Awesome burger bar", "loc" : [ 45, 45 ] }

Cool things you can use this with is f.ex with the data at https://nycopendata.socrata.com/browse?tags=geographic you can create queries slicing new york into areas and look for data points inside those areas. So we’ve seen how we can query geo spatially in a lot of different ways. In closing we want to mention some simple ideas to get your mind churning.

Geospatial interesting tidbits

So geospatial is what we mostly promote the features as but at some point you’ll realize that it’s a generic set of 2d indexes that can be used to index and x,y data. You could consider indexing any data points that fit into a 2d space and using the geo query functionality to retrieve subsets of that data. Say if you map price vs apartment size and want to say giving an apartment find me everything that is “close” to the ideal price and size that I’m looking for. The limit here is your fantasy but as you can see it’s a pretty general and very powerful feature once you get over looking at the feature as a pure geographical function. With that I leave you to experiment and have fun with the features we have introduced.

Links and stuff

1.1.10 of the driver released, important notice attached

List of changes in the drive

One important thing it will print a warning to console now unless you set the “safe” mode of the driver

new Db(new Server(“localhost”, 27017), {safe:true})

http://mongodb.github.com/node-mongodb-native/api-generated/db.html#constructor

This is due to a planned change in the near future from non safe writes as default to safe writes. The warning is there for you to make a conscious decision for your code what write concern you wish to be the default for all operation. More about write concerns on.

http://www.mongodb.org/display/DOCS/getLastError_old

TCP keepalive

One thing that comes up quite frequently as a question when using the mongodb node.js driver is a socket that stops responding. This usually have two sources.

  1. There is a firewall in between the application and the mongodb instance and it does not observe keepAlive.
  2. The socket timeout is to high on your system leaving the socket hanging and never closing.

The first situation can be remedied by setting the socket connection options and enabling keepAlive and setting a hard timeout value on the socket. This will ensure that a correctly configured firewall will keep the connection alive and if it does not it will timeout.

The other thing to tweak is the os tcp_keepalive_time. Basically it’s to high for something like MongoDB (default 2 hours on linux). Setting this lower will correctly timeout dead sockets and let the driver recover.

A good link to read more about it.

http://www.mongodb.org/display/DOCS/Troubleshooting#Troubleshooting-Socketerrorsinshardedclustersandreplicasets

New features in the driver for MongoDB 2.2

Mongo Driver and Mongo DB 2.2 Features

For Mongo DB there are multiple new features and improvements in the driver. This include Mongos failover support, authentication, replicaset support, read preferences and aggregation. Let’s move throught the different new features starting with.

Read preferences

Read preferences is now backed by a specification and is more consistent across drivers. With read preferences you can control from where your Reads are happing in a Replicaset and from Mongo DB also in a shard. Let’s go through the different types of read Preferences that are available and what they mean.

  • ReadPreference.PRIMARY: Read from primary only. All operations produce an error (throw an exception where applicable) if primary is unavailable. Cannot be combined with tags (This is the default.)
  • ReadPreference.PRIMARY_PREFERRED: Read from primary if available, otherwise a secondary.
  • ReadPreference.SECONDARY: Read from secondary if available, otherwise error.
  • ReadPreference.SECONDARY_PREFERRED: Read from a secondary if available, otherwise read from the primary.
  • ReadPreference.NEAREST: All modes read from among the nearest candidates, but unlike other modes, NEAREST will include both the primary and all secondaries in the random selection. The name NEAREST is chosen to emphasize its use, when latency is most important. For I/O-bound users who want to distribute reads across all members evenly regardless of ping time, set secondaryAcceptableLatencyMS very high. See “Ping Times” below. A strategy must be enabled on the ReplSet instance to use NEAREST as it requires intermittent setTimeout events, see Db class documentation

Additionally you can now use tags with all the read preferences to actively choose specific sets of servers in a replicatset or sharded system located in different data centers. The rules are fairly simple as outline below. A server member matches a tag set if its tags match all the tags in the set. For example, a member tagged { dc: ‘ny’, rack: 2, size: ‘large’ } matches the tag set { dc: ‘ny’, rack: 2 }. A member’s extra tags don’t affect whether it’s a match.

Using a read preference is very simple. Below are some examples using it at the db level, collection level and individual query level as well as an example using tags.

Below is a simple example using readpreferences at the db level.

var mongo = require('mongodb'),
  ReplSet = mongo.ReplSet,
  ReadPreference = mongodb.ReadPreference,
  Db = mongo.Db;

// Replica configuration
var replSet = new ReplSet( [
    new Server( "localhost", 27017),
    new Server( "localhost", 27018),
    new Server( "localhost", 27019)
  ], {rs_name: "foo"}
);

// Instantiate a new db object
var db = new Db('exampleDb', replSet, {readPreference: ReadPreference.SECONDARY_PREFERRED});
db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");
  }
});

Below is a simple example using readpreferences at the collection level.

var mongo = require('mongodb'),
  ReplSet = mongo.ReplSet,
  ReadPreference = mongodb.ReadPreference,
  Db = mongo.Db;

// Replica configuration
var replSet = new ReplSet( [
    new Server( "localhost", 27017),
    new Server( "localhost", 27018),
    new Server( "localhost", 27019)
  ], {rs_name: "foo"}
);

// Instantiate a new db object
var db = new Db('exampleDb', replSet);
db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");

    var collection = db.collection('somecollection', {readPreference: ReadPreference.SECONDARY_PREFERRED});
    collection.find({}).toArray(function(err, items) {
      // Done reading from secondary if available
    })
  }
});

Below is a simple example using readpreferences at the query level.

var mongo = require('mongodb'),
  ReplSet = mongo.ReplSet,
  ReadPreference = mongodb.ReadPreference,
  Db = mongo.Db;

// Replica configuration
var replSet = new ReplSet( [
    new Server( "localhost", 27017),
    new Server( "localhost", 27018),
    new Server( "localhost", 27019)
  ], {rs_name: "foo"}
);

// Instantiate a new db object
var db = new Db('exampleDb', replSet);
db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");

    var collection = db.collection('somecollection');
    collection.find({}).setReadPreference(new ReadPreference(ReadPreference.SECONDARY_PREFERRED)).toArray(function(err, items) {
      // Done reading from secondary if available
    })
  }
});

Below is a simple example using a readpreference with tags at the query level. This example will pick from the set of servers tagged with dc1:ny.

var mongo = require('mongodb'),
  ReplSet = mongo.ReplSet,
  ReadPreference = mongodb.ReadPreference,
  Db = mongo.Db;

// Replica configuration
var replSet = new ReplSet( [
    new Server( "localhost", 27017),
    new Server( "localhost", 27018),
    new Server( "localhost", 27019)
  ], {rs_name: "foo"}
);

// Instantiate a new db object
var db = new Db('exampleDb', replSet);
db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");

    var collection = db.collection('somecollection');
    collection.find({}).setReadPreference(new ReadPreference(ReadPreference.SECONDARY_PREFERRED, {"dc1":"ny"})).toArray(function(err, items) {
      // Done reading from secondary if available
    })
  }
});

Mongos

There is now a seperate Server type for Mongos that handles not only Mongos read preferences for Mongo DB but also failover and picking the nearest Mongos proxy to your application. To use simply do

var mongo = require('mongodb'),
  Mongos = mongo.Mongos,
  Db = mongo.Db;

// Set up mongos connection
var mongos = new Mongos([
    new Server("localhost", 50000, { auto_reconnect: true }),
    new Server("localhost", 50001, { auto_reconnect: true })
  ])

// Instantiate a new db object
var db = new Db('exampleDb', server);
db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");
  }

  db.close();
});

Read preferences also work with Mongos from Mongo DB 2.2 or higher allowing you to create more complex deployment setups.

Aggregation framework helper

The MongoDB aggregation framework provides a means to calculate aggregate values without having to use map-reduce. While map-reduce is powerful, using map-reduce is more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values.

The driver supports the aggregation framework by adding a helper at the collection level to execute an aggregation pipeline against the documents in that collection. Below is a simple example of using the aggregation framework to perform a group by tags.

var mongo = require('mongodb'),
  Server = mongo.Server,
  Db = mongo.Db;

// Some docs for insertion
var docs = [{
    title : "this is my title", author : "bob", posted : new Date() ,
    pageViews : 5, tags : [ "fun" , "good" , "fun" ], other : { foo : 5 },
    comments : [
      { author :"joe", text : "this is cool" }, { author :"sam", text : "this is bad" }
    ]}];

var db = new Db(new Server('localhost', 27017));
db.open(function(err, db) {
  // Create a collection
  db.createCollection('test', function(err, collection) {
    // Insert the docs
    collection.insert(docs, {safe:true}, function(err, result) {

      // Execute aggregate, notice the pipeline is expressed as an Array
      collection.aggregate([
          { $project : {
            author : 1,
            tags : 1
          }},
          { $unwind : "$tags" },
          { $group : {
            _id : {tags : "$tags"},
            authors : { $addToSet : "$author" }
          }}
        ], function(err, result) {
          console.dir(result);
          db.close();
      });
    });
  });
});

Replicaset improvements and changes

Replicasets now return to the driver when a primary has been identified allowing for faster connect time meaning the application does not have to wait for the whole set to be identified before being able to run. That said any secondary queries using read preference ReadPreference.SECONDARY might fail until at least one secondary is up. To aid in development of layers above the driver now emits to new events.

  • open is emitted when the driver is ready to be used.
  • fullsetup is emitted once the whole replicaset is up and running

To ensure better control over timeouts when attempting to connect to replicaset members that might be down there is now two timeout settings.

  • connectTimeoutMS: set the timeout for the intial connect to the mongod or mongos instance.
  • socketTimeoutMS: set the timeout for established connections to the mongod or mongos instance.

High availability “on” by default

The high availability code has been rewritten to run outside a setTimeout allowing for better control and handling. It’s also on by default now. It can be disabled using the following settings on the ReplSet class.

  • ha {Boolean, default:true}, turn on high availability.
  • haInterval {Number, default:2000}, time between each replicaset status check.

    This allows the driver to discover new replicaset members or replicaset members who left the set and then returned.

Better stream support for GridFS

GridFS now supports the streaming api’s for node allowing you to pipe content either into or out of a Gridstore object making it easy to work with other streaming api’s available.

A simple example is shown below for how to stream from a file on disk to a gridstore object.

var mongo = require('mongodb'),
  fs = require('fs'),
  Server = mongo.Server,
  GridStore = mongo.GridStore,
  Db = mongo.Db;

var db = new Db(new Server("localhost", 27017, {auto_reconnect:true}));
db.open(function(err, client) {
  // Set up gridStore
  var gridStore = new GridStore(client, "test_stream_write", "w");
  // Create a file reader stream to an object
  var fileStream = fs.createReadStream("./test/gridstore/test_gs_working_field_read.pdf");
  gridStore.on("close", function(err) {
    // Just read the content and compare to the raw binary
    GridStore.read(client, "test_stream_write", function(err, gridData) {
      var fileData = fs.readFileSync("./test/gridstore/test_gs_working_field_read.pdf");
      test.deepEqual(fileData, gridData);
      test.done();
    })
  });

  // Pipe it through to the gridStore
  fileStream.pipe(gridStore);
})

A simple example is shown below for how to stream from a gridfs file to a file on disk.

var mongo = require('mongodb'),
  fs = require('fs'),
  Server = mongo.Server,
  GridStore = mongo.GridStore,
  Db = mongo.Db;

var db = new Db(new Server("localhost", 27017, {auto_reconnect:true}));
db.open(function(err, client) {
  // Set up gridStore
  var gridStore = new GridStore(client, "test_stream_write_2", "w");
  gridStore.writeFile("./test/gridstore/test_gs_working_field_read.pdf", function(err, result) {
    // Open a readable gridStore
    gridStore = new GridStore(client, "test_stream_write_2", "r");
    // Create a file write stream
    var fileStream = fs.createWriteStream("./test_stream_write_2.tmp");
    fileStream.on("close", function(err) {
      // Read the temp file and compare
      var compareData = fs.readFileSync("./test_stream_write_2.tmp");
      var originalData = fs.readFileSync("./test/gridstore/test_gs_working_field_read.pdf");
      test.deepEqual(originalData, compareData);
      test.done();
    })
    // Pipe out the data
    gridStore.pipe(fileStream);
  });
})

toBSON method

If in an object now has a toBSON function it will be called to for custom serialization of the object instance. This can be used to just serialize wanted fields. Deserializing is not affected by this and the application is responsible for deflating objects again.

A simple example below

var customObject = {
    a:1
    b:2
    toBSON: function() {
      return {a:this.a}
    }
  }

Much faster BSON C++ parser

Thanks to the awesome people at Lucasfilm Singapore we have a new BSON C++ serializer/deserializer that performs on average 40-50% faster than the current implementation.

Other minor changes

  • Connection pool is now set to 5 by default. Override if there is need for either a bigger or smaller pool per node process.
  • Gridfs now ensures an index on the chunks collection on file_id.
Node knockout tutorial 2, A primer for GridFS using the Mongo DB driver

A primer for GridFS using the Mongo DB driver

In the first tutorial we targeted general usage of the database. But Mongo DB is much more than this. One of the additional very useful features is to act as a file storage system. This is accomplish in Mongo by having a file collection and a chunks collection where each document in the chunks collection makes up a Block of the file. In this tutorial we will look at how to use the GridFS functionality and what functions are available.

A simple example

Let’s dive straight into a simple example on how to write a file to the grid using the simplified Grid class.

var mongo = require('mongodb'),
  Server = mongo.Server,
  Db = mongo.Db,
Grid = mongo.Grid;

var server = new Server('localhost', 27017, {auto_reconnect: true});
var db = new Db('exampleDb', server);

db.open(function(err, db) {
  if(!err) {
    var grid = new Grid(db, 'fs');    
    var buffer = new Buffer("Hello world");
    grid.put.(buffer, {metadata:{category:'text'}, content_type: 'text'}, function(err, fileInfo) {
      if(!err) {
        console.log("Finished writing file to Mongo");
      }
    });
  }
});

All right let’s dissect the example. The first thing you’ll notice is the statement

var grid = new Grid(db, 'fs');

Since GridFS is actually a special structure stored as collections you’ll notice that we are using the db connection that we used in the previous tutorial to operate on collections and documents. The second parameter ‘fs’ allows you to change the collections you want to store the data in. In this example the collections would be fs_files and fs_chunks.

Having a life grid instance we now go ahead and create some test data stored in a Buffer instance, although you can pass in a string instead. We then write our data to disk.

var buffer = new Buffer("Hello world");
grid.put.(buffer, {metadata:{category:'text'}, content_type: 'text'}, function(err, fileInfo) {
  if(!err) {
    console.log("Finished writing file to Mongo");
  }
});

Let’s deconstruct the call we just made. The put call will write the data you passed in as one or more chunks. The second parameter is a hash of options for the Grid class. In this case we wish to annotate the file we are writing to Mongo DB with some metadata and also specify a content type. Each file entry in GridFS has support for metadata documents which might be very useful if you are for example storing images in you Mongo DB and need to store all the data associated with the image.

One important thing is to take not that the put method return a document containing a _id, this is an ObjectID identifier that you’ll need to use if you wish to retrieve the file contents later.

Right so we have written out first file, let’s look at the other two simple functions supported by the Grid class.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    var grid = new Grid(db, 'fs');    
    var buffer = new Buffer("Hello world");
    grid.put.(buffer, {metadata:{category:'text'}, content_type: 'text'}, function(err, fileInfo) {        
      grid.get(fileInfo._id, function(err, data) {
        console.log("Retrieved data: " + data.toString());
        grid.delete(fileInfo._id, function(err, result) {
        });        
      });
    });
  }
});

Let’s have a look at the two operations get and delete

grid.get(fileInfo._id, function(err, data) {});

The get method takes an ObjectID as the first argument and as we can se in the code we are using the one provided in fileInfo._id. This will read all the chunks for the file and return it as a Buffer object.

The delete method also takes an ObjectID as the first argument but will delete the file entry and the chunks associated with the file in Mongo.

This api is the simplest one you can use to interact with GridFS but it’s not suitable for all kinds of files. One of it’s main drawbacks is you are trying to write large files to Mongo. This api will require you to read the entire file into memory when writing and reading from Mongo which most likely is not feasible if you have to store large files like Video or RAW Pictures. Luckily this is not the only way to work with GridFS. That’s not to say this api is not useful. If you are storing tons of small files the memory usage vs the simplicity might be a worthwhile tradeoff. Let’s dive into some of the more advanced ways of using GridFS.

Advanced GridFS or how not to run out of memory

As we just said controlling memory consumption for you file writing and reading is key if you want to scale up the application. That means not reading in entire files before either writing or reading from Mongo DB. The good news it’s supported. Let’s throw some code out there straight away and look at how to do chunk sized streaming writes and reads.

the requires and and other initializing stuff omitted for brevity

var fileId = new ObjectID();
var gridStore = new GridStore(db, fileId, "w", {root:'fs'});
gridStore.chunkSize = 1024 * 256;

gridStore.open(function(err, gridStore) {
 Step(
   function writeData() {
     var group = this.group();

     for(var i = 0; i < 1000000; i += 5000) {
       gridStore.write(new Buffer(5000), group());
     }   
   },

   function doneWithWrite() {
     gridStore.close(function(err, result) {
       console.log("File has been written to GridFS");
     });
   }
 )
});

Before we jump into picking apart the code let’s look at

var gridStore = new GridStore(db, fileId, "w", {root:'fs'});

Notice the parameter “w” this is important. It tells the driver that you are planning to write a new file. The parameters you can use here are.

  • “r” – read only. This is the default mode
  • “w” – write in truncate mode. Existing data will be overwritten
  • “w+” – write in edit mode

Right so there is a fair bit to digest here. We are simulating writing a file that’s about 1MB big to Mongo DB using GridFS. To do this we are writing it in chunks of 5000 bytes. So to not live with a difficult callback setup we are using the Step library with its’ group functionality to ensure that we are notified when all of the writes are done. After all the writes are done Step will invoke the next function (or step) called doneWithWrite where we finish up by closing the file that flushes out any remaining data to Mongo DB and updates the file document.

As we are doing it in chunks of 5000 bytes we will notice that memory consumption is low. This is the trick to write large files to GridFS. In pieces. Also notice this line.

gridStore.chunkSize = 1024 * 256;

This allows you to adjust how big the chunks are in bytes that Mongo DB will write. You can tune the Chunk Size to your needs. If you need to write large files to GridFS it might be worthwhile to trade of memory for CPU by setting a larger Chunk Size.

Now let’s see how the actual streaming read works.

var gridStore = new GridStore(db, fileId, "r");
gridStore.open(function(err, gridStore) {
  var stream = gridStore.stream(true);

  stream.on("data", function(chunk) {
    console.log("Chunk of file data");
  });

  stream.on("end", function() {
    console.log("EOF of file");
  });

  stream.on("close", function() {
    console.log("Finished reading the file");
  });
});

Right let’s have a quick lock at the streaming functionality supplied with the driver (make sure you are using 0.9.6-12 or higher as there is a bug fix for custom chunksizes that you need)

var stream = gridStore.stream(true);

This opens a stream to our file, you can pass in a boolean parameter to tell the driver to close the file automatically when it reaches the end. This will fire the close event automatically. Otherwise you’ll have to handle cleanup when you receive the end event. Let’s have a look at the events supported.

  stream.on("data", function(chunk) {
    console.log("Chunk of file data");
  });

The data event is called for each chunk read. This means that it’s by the chunk size of the written file. So if you file is 1MB big and the file has chunkSize 256K then you’ll get 4 calls to the event handler for data. The chunk returned is a Buffer object.

  stream.on("end", function() {
    console.log("EOF of file");
  });

The end event is called when the driver reaches the end of data for the file.

  stream.on("close", function() {
    console.log("Finished reading the file");
  });

The close event is only called if you the autoclose parameter on the gridStore.stream method as shown above. If it’s false or not set handle cleanup of the streaming in the end event handler.

Right that’s it for writing to GridFS in an efficient Manner. I’ll outline some other useful function on the Gridstore object.

Other useful methods on the Gridstore object

There are some other methods that are useful

gridStore.writeFile(filename/filedescriptor, function(err fileInfo) {});

writeFile takes either a file name or a file descriptor and writes it to GridFS. It does this in chunks to ensure the Eventloop is not tied up.

gridStore.read(length, function(err, data) {});

read/readBuffer lets you read a #length number of bytes from the current position in the file.

gridStore.seek(position, seekLocation, function(err, gridStore) {});

seek lets you navigate the file to read from different positions inside the chunks. The seekLocation allows you to specify how to seek. It can be one of three values.

  • GridStore.IOSEEKSET Seek mode where the given length is absolute
  • GridStore.IOSEEKCUR Seek mode where the given length is an offset to the current read/write head
  • GridStore.IOSEEKEND Seek mode where the given length is an offset to the end of the file

    GridStore.list(dbInstance, collectionName, {id:true}, function(err, files) {})

list lists all the files in the collection in GridFS. If you have a lot of files the current version will not work very well as it’s getting all files into memory first. You can have it return either the filenames or the ids for the files using option.

gridStore.unlink(function(err, result) {});

unlink deletes the file from Mongo DB, that’s to say all the file info and all the chunks.

This should be plenty to get you on your way building your first GridFS based application. As in the previous article the following links might be useful for you. Good luck and have fun.

Links and stuff

Node knockout tutorial 1, A Basic introducton to Mongo DB

A Basic introduction to Mongo DB

Mongo DB has rapidly grown to become a popular database for web applications and is a perfect fit for Node.JS applications, letting you write Javascript for the client, backend and database layer. It’s schemaless nature is a better match to our constantly evolving data structures in web applications and the integrated support for location queries a bonus that it’s hard to ignore. Throw Replicasets for scaling and we are looking at really nice platform to grow your storage needs now and in the future.

Now to shamelessly plug my driver. It can be downloaded either using npm or fetched from the github repository. To install via npm do the following.

npm install mongodb

or go fetch it from github at https://github.com/christkv/node-mongodb-native

Once this business is taken care of let’s move through the types available for the driver and then how to connect to your Mongo DB instance before facing the usage of some crud operations.

Mongo DB data types

So there is an important thing to keep in mind when working with Mongo DB and that is that there is a slight mapping difference between the types supported in Mongo DB and what is native types in Javascript. Let’s have a look at the types supported out of the box and then how types are promoted by the driver to try to fit as close to the native Javascript types as possible.

  • Float is a 8 byte and is directly convertible to the Javascript type Number
  • Double class a special class representing a float value, this is especially useful when using capped collections where you need to ensure your values are always floats.
  • Integers is a bit trickier due to the fact that Javascript represents all Numbers as 64 bit floats meaning that the maximum integer value is at a 53 bit. Mongo has two types for integers, a 32 bit and a 64 bit. The driver will try to fit the value into 32 bits if it can and promote it to 64 bits if it has to. Similarly it will deserialize attempting to fit it into 53 bits if it can. If it cannot it will return an instance of Long to avoid loosing precession.
  • Long class a special class that let’s you store 64 bit integers and also let’s you operate on the 64 bits integers.
  • Date maps directly to a Javascript Date
  • RegEp maps directly to a Javascript RegExp
  • String maps directly to a Javascript String (encoded in utf8)
  • Binary class a special class that let’s you store data in Mongo DB
  • Code class a special class that let’s you store javascript functions in Mongo DB, can also provide a scope to run the method in
  • ObjectID class a special class that holds a MongoDB document identifier (the equivalent to a Primary key)
  • DbRef class a special class that let’s you include a reference in a document pointing to another object
  • Symbol class a special class that let’s you specify a symbol, not really relevant for javascript but for languages that supports the concept of symbols.

As we see the number type can be a little tricky due to the way integers are implemented in Javascript. The latest driver will do correct conversion up to 53 bit’s of complexity. If you need to handle big integers the recommendation is to use the Long class to operate on the numbers.

Getting that connection to the database

Let’s get around to setting up a connection with the Mongo DB database. Jumping straight into the code let’s do direct connection and then look at the code.

var mongo = require('mongodb'),
  Server = mongo.Server,
  Db = mongo.Db;

var server = new Server('localhost', 27017, {auto_reconnect: true});
var db = new Db('exampleDb', server);

db.open(function(err, db) {
  if(!err) {
    console.log("We are connected");
  }
});

Let’s have a quick look at the simple connection. The new Server(…) sets up a configuration for the connection and the auto_reconnect tells the driver to retry sending a command to the server if there is a failure. Another option you can set is poolSize, this allows you to control how many tcp connections are opened in parallel. The default value for this is 1 but you can set it as high as you want. The driver will use a round-robin strategy to dispatch and read from the tcp connection.

We are up and running with a connection to the database. Let’s move on and look at what collections are and how they work.

Mongo DB and Collections

Collections are the equivalent of tables in traditional databases and contain all your documents. A database can have many collections. So how do we go about defining and using collections. Well there are a couple of methods that we can use. Let’s jump straight into code and then look at the code.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    db.collection('test', function(err, collection) {});

    db.collection('test', {safe:true}, function(err, collection) {});

    db.createCollection('test', function(err, collection) {});

    db.createCollection('test', {safe:true}, function(err, collection) {});
  }
});  

Three different ways of creating a collection object but slightly different in behavior. Let’s go through them and see what they do

db.collection('test', function(err, collection) {});

This function will not actually create a collection on the database until you actually insert the first document.

db.collection('test', {safe:true}, function(err, collection) {});

Notice the {safe:true} option. This option will make the driver check if the collection exists and issue an error if it does not.

db.createCollection('test', function(err, collection) {});

This command will create the collection on the Mongo DB database before returning the collection object. If the collection already exists it will ignore the creation of the collection.

db.createCollection('test', {safe:true}, function(err, collection) {});

The {safe:true} option will make the method return an error if the collection already exists.

With an open db connection and a collection defined we are ready to do some CRUD operation on the data.

And then there was CRUD

So let’s get dirty with the basic operations for Mongo DB. The Mongo DB wire protocol is built around 4 main operations insert/update/remove/query. Most operations on the database are actually queries with special json objects defining the operation on the database. But I’m getting ahead of myself. Let’s go back and look at insert first and do it with some code.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    db.collection('test', function(err, collection) {
      var doc1 = {'hello':'doc1'};
      var doc2 = {'hello':'doc2'};
      var lotsOfDocs = [{'hello':'doc3'}, {'hello':'doc4'}];

      collection.insert(doc1);

      collection.insert(doc2, {safe:true}, function(err, result) {});

      collection.insert(lotsOfDocs, {safe:true}, function(err, result) {});
    });
  }
});

A couple of variations on the theme of inserting a document as we can see. To understand why it’s important to understand how Mongo DB works during inserts of documents.

Mongo DB has asynchronous insert/update/remove operations. This means that when you issue an insert operation its a fire and forget operation where the database does not reply with the status of the insert operation. To retrieve the status of the operation you have to issue a query to retrieve the last error status of the connection. To make it simpler to the developer the driver implements the {safe:true} options so that this is done automatically when inserting the document. {safe:true} becomes especially important when you do update or remove as otherwise it’s not possible to determine the amount of documents modified or removed.

Now let’s go through the different types of inserts shown in the code above.

collection.insert(doc1);

Taking advantage of the async behavior and not needing confirmation about the persisting of the data to Mongo DB we just fire off the insert (we are doing live analytics, loosing a couple of records does not matter).

collection.insert(doc2, {safe:true}, function(err, result) {});

That document needs to stick. Using the {safe:true} option ensure you get the error back if the document fails to insert correctly.

collection.insert(lotsOfDocs, {safe:true}, function(err, result) {});

A batch insert of document with any errors being reported. This is much more efficient if you need to insert large batches of documents as you incur a lot less overhead.

Right that’s the basics of insert’s ironed out. We got some documents in there but want to update them as we need to change the content of a field. Let’s have a look at a simple example and then we will dive into how Mongo DB updates work and how to do them efficiently.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    db.collection('test', function(err, collection) {
      var doc = {mykey:1, fieldtoupdate:1};

      collection.insert(doc, {safe:true}, function(err, result) {
        collection.update({mykey:1}, {$set:{fieldtoupdate:2}}, {safe:true}, function(err, result) {});      
      });

      var doc2 = {mykey:2, docs:[{doc1:1}]};

      collection.insert(doc2, {safe:true}, function(err, result) {
        collection.update({mykey:2}, {$push:{docs:{doc2:1}}, {safe:true}, function(err, result) {});
      });
    });
  };
});

Alright before we look at the code we want to understand how document updates work and how to do the efficiently. The most basic and less efficient way is to replace the whole document, this is not really the way to go if you want to change just a field in your document. Luckily Mongo DB provides a whole set of operations that let you modify just pieces of the document Atomic operations documentation. Basically outlined below.

  • $inc – increment a particular value by a certain amount
  • $set – set a particular value
  • $unset – delete a particular field (v1.3+)
  • $push – append a value to an array
  • $pushAll – append several values to an array
  • $addToSet – adds value to the array only if its not in the array already
  • $pop – removes the last element in an array
  • $pull – remove a value(s) from an existing array
  • $pullAll – remove several value(s) from an existing array
  • $rename – renames the field
  • $bit – bitwise operations

Now that the operations are outline let’s dig into the specific cases show in the code example.

collection.update({mykey:1}, {$set:{fieldtoupdate:2}}, {safe:true}, function(err, result) {});

Right so this update will look for the document that has a field mykey equal to 1 and apply an update to the field fieldtoupdate setting the value to 2. Since we are using the {safe:true} option the result parameter in the callback will return the value 1 indicating that 1 document was modified by the update statement.

collection.update({mykey:2}, {$push:{docs:{doc2:1}}, {safe:true}, function(err, result) {});

This updates adds another document to the field docs in the document identified by {mykey:2} using the atomic operation $push. This allows you to modify keep such structures as queues in Mongo DB.

Let’s have a look at the remove operation for the driver. As before let’s start with a piece of code.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    db.collection('test', function(err, collection) {
      var docs = [{mykey:1}, {mykey:2}, {mykey:3}];

      collection.insert(docs, {safe:true}, function(err, result) {

        collection.remove({mykey:1});

        collection.remove({mykey:2}, {safe:true}, function(err, result) {});

        collection.remove();
      });
    });
  };
});

Let’s examine the 3 remove variants and what they do.

collection.remove({mykey:1});

This leverages the fact that Mongo DB is asynchronous and that it does not return a result for insert/update/remove to allow for synchronous style execution. This particular remove query will remove the document where mykey equals 1.

collection.remove({mykey:2}, {safe:true}, function(err, result) {});

This remove statement removes the document where mykey equals 2 but since we are using {safe:true} it will back to Mongo DB to get the status of the remove operation and return the number of documents removed in the result variable.

collection.remove();

This last one will remove all documents in the collection.

Time to Query

Queries is of course a fundamental part of interacting with a database and Mongo DB is no exception. Fortunately for us it has a rich query interface with cursors and close to SQL concepts for slicing and dicing your datasets. To build queries we have lots of operators to choose from Mongo DB advanced queries. There are literarily tons of ways to search and ways to limit the query. Let’s look at some simple code for dealing with queries in different ways.

the requires and and other initializing stuff omitted for brevity

db.open(function(err, db) {
  if(!err) {
    db.collection('test', function(err, collection) {
      var docs = [{mykey:1}, {mykey:2}, {mykey:3}];

      collection.insert(docs, {safe:true}, function(err, result) {

        collection.find().toArray(function(err, items) {});

        var stream = collection.find({mykey:{$ne:2}}).streamRecords();
        stream.on("data", function(item) {});
        stream.on("end", function() {});

        collection.findOne({mykey:1}, function(err, item) {});

      });
    });
  };
});

Before we start picking apart the code there is one thing that needs to be understood, the find method does not execute the actual query. It builds an instance of Cursor that you then use to retrieve the data. This lets you manage how you retrieve the data from Mongo DB and keeps state about your current Cursor state on Mongo DB. Now let’s pick apart the queries we have here and look at what they do.

collection.find().toArray(function(err, items) {});

This query will fetch all the document in the collection and return them as an array of items. Be careful with the function toArray as it might cause a lot of memory usage as it will instantiate all the document into memory before returning the final array of items. If you have a big resultset you could run into memory issues.

var stream = collection.find({mykey:{$ne:2}}).streamRecords();
stream.on("data", function(item) {});
stream.on("end", function() {});

This is the preferred way if you have to retrieve a lot of data for streaming, as data is deserialized a data event is emitted. This keeps the resident memory usage low as the documents are streamed to you. Very useful if you are pushing documents out via websockets or some other streaming socket protocol. Once there is no more document the driver will emit the end event to notify the application that it’s done.

collection.findOne({mykey:1}, function(err, item) {});

This is special supported function to retrieve just one specific document bypassing the need for a cursor object.

That’s pretty much it for the quick intro on how to use the database. I have also included a list of links to where to go to find more information and also a sample crude location application I wrote using express JS and mongo DB.

Links and stuff

Ad people bullshitting (Taken with Instagram at Fira de Barcelona)

Ad people bullshitting (Taken with Instagram at Fira de Barcelona)

Dinner for speakers (Taken with Instagram at Las caballerizas)

Dinner for speakers (Taken with Instagram at Las caballerizas)

Wild boar (Taken with instagram)

Wild boar (Taken with instagram)

Offsite xing (Taken with Instagram at Denmark)

Offsite xing (Taken with Instagram at Denmark)