MongoDB Querying Tips & Tricks

| Comments

This post contains a bunch of tricks I picked up over the years working with MongoDB. They are also implemented in MQL as helper functions.

Filter on creation time

Did you know you can filter on creation time without adding a timestamp field?

By default, a document’s _id field contains an ObjectId. This ObjectId is composed from several values: number of seconds from the epoch, machine identifier, process identifier and a counter.

Since the ObjectId starts with a timestamp you can order by it or use it for filtering. The following function generates an ObjectId from a python datetime object:

1
2
3
4
5
6
7
8
import datetime, time, bson

def oid_from_date(dt):
    seconds_from_epoch = int(time.mktime(dt.timetuple()))
    return bson.ObjectId(hex(seconds_from_epoch)[2:] + '0' * 16)

>>> oid_from_date(datetime.datetime(2015, 7, 4, 12, 0, 0))
ObjectId('5597a0900000000000000000')

We can now use this for querying documents in a specific range:

1
>>> connection.test.test.find({_id: {'$gt': oid_from_date(datetime.datetime(2015, 7, 3)), '$lt': oid_from_date(datetime.datetime(2015, 7, 4))}})

Here’s how you would do it with MQL:

1
select * from test.test where _id between oidFromDate('2015-07-03') and oidFromDate('2015-07-04')

Convert datetime to date

It’s often useful to aggregate data in different time resolutions: yearly, monthly, daily, etc’.

MongoDB lets us group by several fields but having a seperate field for year, month and day isn’t natural.

As of MongoDB 3.0 there’s a builtin function called dateToString that does exactly that. The following solution is useful if you’re still on 2.6.

In order to generate a date from a datetime we can construct a number that contains the year, month and day. For example, the datetime 2015-07-04 12:00:00 can be represented as a date using a number: 20150704. Here’s how you would do it with MQL:

1
select year(timestamp) * 10000 + month(timestamp) * 100 + dayOfMonth(timestamp) as date from test.test

There’s a builtin helper function that generates the same expression:

1
select extractDate(timestamp) as date from test.test

Here’s how you can do it with MongoDB’s aggregation framework:

1
2
3
4
5
6
7
connection.test.test.aggregate([
  {'$project':
    {'date':
      {'$add': [{'$multiply': [{'$year': '$timestamp'}, 10000]},
                {'$multiply': [{'$month': '$timestamp'}, 100]},
                {'$dayOfMonth': '$timestamp']}}}
])

You can use the same concept for aggregating on other resolutions using additional functions like $hour and $minute.

Floating point manipulation

floor has been introduced in MongoDB 3.2 but since it’s not out yet, you can do the following:

1
2
3
4
connection.test.test.aggregate([
  {'$project':
    {'number': {'$subtract': ['$number', {'$mod': ['$number', 1]}]}}}
])

Here’s how you would do it with MQL:

1
select number - number % 1 from test.test

There’s a builtin helper function that generates the same expression:

1
select toInt(number) from test.test

Compare fields

MongoDB’s find() doesn’t support comparing fields but the aggregation framework gives us the building blocks to implement it:

1
2
3
4
5
6
connection.test.test.aggregate([
  {'$project':
    {'a = b': {'$eq': ['$a', '$b']}}},
  {'$match':
    {'a = b': True}}
])

Here’s how you would do it with MQL:

1
select a from test.test where a=b

Argument Binding in Python

| Comments

After a recent debate about argument binding I decided to list some pros and cons of the different methods to bind arguments in python. Let’s start with a review of available methods:

argument binding
1
2
3
4
5
6
7
8
9
def add(x, y):
    return x + y

from functools import partial
add5_partial = partial(add, 5)
add5_partial(10) # 15

add5_lambda = lambda x: add(x, 5)
add5_lambda(10) # 15

My beef with partial

partial is not a function and misbehaves where functions are expected. partial can be easily implemented in pure python but I can only guess it was implemented in C for performance considerations. Lets review some examples:

1. Partial doesn’t work on methods:

partial on methods
1
2
3
4
5
6
7
8
9
10
11
12
from functools import partial

class Cell(object):
    def set_state(self, state):
        self._state = state
    set_alive = partial(set_state, state=True)
    set_dead = partial(set_state, state=False)

>>> Cell().set_alive()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: set_state() missing 1 required positional argument: *self*

Why is that?

You know how self always gets assigned the instance when calling an instance method? Well, it’s implemented using a mechanism called descriptors. In order to support that functionality the function type implements a __get__ method.

Here’s how a method call works:

methods explained
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class Person(object):
    def __init__(self, name):
        self._name = name
    def speak(self):
        print 'My name is', self._name

>>> p = Person('Neo')
>>> p.speak() # a method call
My name is Neo

>>> # so what's the difference between a method and a function?
>>> method = p.speak # the method encaspulates the function and the instance
>>> method
<bound method Person.speak of <__main__.Person object at 0x109d7bb90>>
>>> method.im_self # that's where self is hiding!
<__main__.Person object at 0x109d7bb90>>
>>> method.im_func # that's where the function is hiding!
<function Person.speak at 0x106163950>
>>> method() # same as: method.im_func(method.im_self)
My name is Neo

>>> # what transforms a function into a method?
>>> Person().speak # triggers __getattribute__('speak') 
>>> # __getattribute__ looks for the attribute in the instance's __dict__
>>> # __getattribute__ then looks for the attribute in the class's __dict__
>>> # after it finds it, it checks if the value (a function) implements a __get__ method
>>> # if it doesn't implement __get__, return the value
>>> # if it does, return whatever __get__ returns
>>> method = Person.speak.__get__(Person('Neo'))
>>> method
<bound method Person.speak of <__main__.Person object at 0x109d7bb90>>

2. partial cannot be inspected:

inspect and partial
1
2
3
4
5
6
7
8
9
10
11
12
13
>>> import inspect, functools
>>> p = functools.partial(lambda x, y: x + y, 10)
>>> inspect.getargspec(p)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/inspect.py", line 823, in getargspec
    getfullargspec(func)
  File "/usr/lib/python3.3/inspect.py", line 850, in getfullargspec
    raise TypeError('{!r} is not a Python function'.format(func))
TypeError: functools.partial(<function <lambda> at 0x1073065f0>, 10) is not a Python function

>>> print p.__doc__ # doesn't preserve the wrapped function's __doc__
partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

3. partial could be safer, validating argument counts and names:

argument validation
1
2
3
4
5
6
>>> from functools import partial
>>> f = partial(lambda: None, 1, 2, 3) # why not verify signature here?!
>>> f()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: <lambda>() takes no arguments (3 given)

Alternatives to partial

You could implement your own partial, that actually returns a function:

an alternative partial
1
2
3
4
5
6
7
from functools import wraps

def partial(func, *a, **k):
    @wraps(func)
    def new_func(*args, **kwargs):
        return func(*(a + args), **dict(k, **kwargs))
    return new_func

Note: don’t use this code, it doesn’t enforce unique keyword arguments.

You could also use a lambda function:

lambda as partial method
1
2
3
4
5
class Cell(object):
     def set_state(self, state):
         self._state = state
     set_alive = lambda self: self.set_state(True)
     set_dead = lambda self: self.set_state(False)

My problem with lambdas

As my friend @EyalIL wrote:

Lambdas capture the variable, partial captures the value.

The latter is much more often useful.

Here’s an example to make that sentence clearer:

lambda bug
1
2
3
4
5
callbacks = []
for i in xrange(5):
    callbacks.append(lambda: i)
>>> print [callback() for callback in callbacks]
[4, 4, 4, 4, 4]

Why is this happening?

Because python supports closures (which are usually a good thing):

closures explained
1
2
3
4
5
6
7
8
9
var = 1
f = lambda: var
print f() # 1
var = 2
print f() # 2
# but, but, how does python know?! well, a function can hold references to variables in outer scopes
print f.func_closure() # (<cell at 0x101bdfb40: int object at 0x7fd3e9c106d8>,)
# what are these cells? the cell is basically a pointer to some name in some scope. holding a reference promises changes to be reflected even when changing immutable datatypes.
print f.func_closure()[0].cell_contents # 2

A solution would be to bind variables who aren’t function arguments as function arguments:

lambdas with default args
1
2
3
4
5
6
callbacks = []
for i in xrange(5):
    callbacks.append(lambda x=i: x)
>>> print [callback() for callback in callbacks]
[0, 1, 2, 3, 4]

Can we do better?

I’d like to propose a mechanism similar to Javascript’s Function.bind functionality.

Here’s how I’d like it to behave (this is a proposal, the code doesn’t actually work):

bind examples
1
2
3
4
5
6
7
8
9
10
11
12
13
def add(x, y):
    return x + y

from functools import partial
add5_partial = partial(add, 5) # requires an import
add5_lambda = lambda x: add(x, 5) # pretty long

add5_bind = add.bind(5) # shortest
import inspect
>>> print inspect.getargspec(add)
ArgSpec(args=['x', 'y'], varargs=None, keywords=None, defaults=None)
>>> print inspect.getargspec(add5_bind) # works with inspect
ArgSpec(args=['y'], varargs=None, keywords=None, defaults=None)

If you dig bind, please comment/vote. Given enough feedback I’ll be motivated to write a PEP.