At the Forge

Use Your Database!

Reuven M. Lerner

Issue #256, August 2015

Objects are great, but your database is the best place to do serious manipulations on your data.

I love high-level, dynamically typed languages, such as Python, Ruby and JavaScript. They're easy—and even fun—to use. They let me express myself richly, and they lend themselves to code that easily can be reused and maintained. It's no surprise that interest in such languages is on the rise, especially when creating Web applications.

Now, one of the downsides of these languages is that they tend to execute more slowly than static languages, such as Java, C# and Go. But for a very large number of Web applications, this speed difference doesn't matter, or it is justified by the productivity gain enjoyed by the engineers, or it can be (somewhat) handled by throwing hardware at the problem.

However, the fact that dynamic languages run more slowly than their static counterparts doesn't mean you want to ignore the speed issue completely. Once you know your way around dynamic languages, as well the frameworks built in them, you get a sense of what runs quickly and what doesn't.

In just the past few weeks, however, I've encountered a pattern—or perhaps I should say, an “anti-pattern”—in the code that several of my consulting clients had written. This anti-pattern is well known to experienced developers, but it seems to be less well known than I would have hoped or expected. That anti-pattern, stated simply, is that you should have the database do as much work as possible.

There are several reasons for throwing as much as possible at your database server. First and foremost, your database almost certainly is written in C, so it's likely to execute more quickly than your high-level, dynamic code.

Second, your database has been highly optimized through the years, such that retrieving data from it has been tuned to take into account memory, disk and the frequency of the retrievals.

Third, although network bandwidth is cheap nowadays, it's not infinitely fast. This means that although you could, in theory, write a database query in Ruby that returns a large number of rows and then filters through them using Enumerate#map, if you have the database do some of this for you, it can reduce the amount of data you're retrieving dramatically and, thus, lead to faster application responses and less network usage.

So in this article, I explore this anti-pattern of doing work in an application that probably should be done in the database. You'll see how you can get the same results, but much faster, by applying this rule. There's obviously no one right way to do things, but having the database do as much work as possible is likely to make your applications faster and easier to maintain.

Don't Load Everything

High-level languages, and most high-level Web frameworks, don't encourage you to write SQL directly. Rather, you use objects and methods to work with the database; the methods you invoke are translated into SQL by an ORM (object-relational mapper). Part of the reason, I believe, for the widespread inefficiencies in people's database queries is that they don't see the SQL they're writing, so they don't understand some of the implications of their method calls.

For example, say I'm working on a project in Django. If I have a model named Person, I can (and should) invoke the “objects” method in order to work with the corresponding table in the database. I then can take the resulting object and apply additional filters, getting (for example) the records corresponding to people who are system administrators:

>>> admins = Person.objects.filter(admin=True).all()

Once I've done that, “admins” will contain a set of records, known in the Django world as a “QuerySet”. But in actuality, a QuerySet doesn't contain the records themselves. Rather, it serves as a go between to the database. If you iterate over the QuerySet, you'll get each of the records, one by one.

Thus, even if you'll eventually get one million records back from the database, the above code doesn't retrieve them. You can get the records, one by one, by iterating over the result set. For example, the following will display the user names for all of the administrators:

>>> for admin in admins:
        print(admin.username)

This is the right way to work with objects in Django. Although it might seem weird not to have the entire result set in memory, the implications are tremendous. You don't need to worry about using up all of the server's memory if the resulting records will be too large.

Working with iterators is easy and straightforward, if you're used to it. If you're not, it might seem strange not to have the entire result set at once, and to iterate over it. Moreover, all you need is the right combination of a result set and the following code:

>>> admins = list(Person.objects.filter(admin=True).all())

Notice what I've changed in the above assignment? I'm no longer asking for the QuerySet, over which I can iterate. Rather, I've asked for the QuerySet's data to be used to create a list and then assigned to “admins”. If you have one million records in your result set, this is going to consume a fairly large amount of memory.

It's true that this can be necessary, at times, but those times are fairly rare. After all, the odds are pretty good that you're retrieving the records in order to display them to users, something that is easily and efficiently accomplished with iteration.

Filtering

Let's say I'm interested in displaying all of the administrators on my system. Above, I showed that you can do that with:

>>> admins = Person.objects.filter(admin=True).all()
>>> for admin in admins:
        print(admin.username)

However, there's a variation of this that I've often seen people do:

>>> people = Person.objects.all()
>>> for person in people:
        if person.admin:
            print(person.username)

Notice what I'm doing here: I'm retrieving all of the objects and then iterating over them. Then, I use an “if” statement in Python to determine whether I want to print the user name. If you're used to working with Python objects, this seems like a perfectly natural thing to do.

However, let's consider what's actually happening here. You're retrieving all of the records and using only a small number of them. This means the database is being forced to read through all of its records, bring all of them into memory and send those records to the Python application—even though the odds are that only a small proportion of these records will be printed.

Moreover, while the “if” statement in Python definitely is quite efficient, there is still some overhead to the lookup of the person.admin attribute, not to mention the creation of a new “Person” object for each record you got back from the database. In other words, you're creating a huge number of Person objects just to display some output.

It's far, far more efficient to do your filtering in the database and create Python objects only for the records that you're most likely to want to display. The database, if defined correctly, has indexes that it can use to speed up the query if you tell it to filter records such that it consumes less memory, less CPU and less network bandwidth.

I've seen a variation on this anti-pattern in that people sometimes want to perform transformations on data that they have retrieved from the database. For example, let's assume that I want to apply a sales tax of 10% on all of the prices in a set of records. I certainly could say:

>>> products = Product.objects.all()
>>> for product in products:
        print(product.price * 1.10)

But it'll be faster and require less Python code, if I simply say:

>>> products = Product.objects.raw('select id, price * 1.10 as
>>> price_with_tax from store_product))
>>> for product in products:
        print(product.price_with_tax)

Notice how the use of raw allows you to go behind the back of Django's ORM, using whatever SQL you want. Is this the way you always want to do things? Surely not. But in specific cases, or when you want to use a function, it definitely can come in handy. Note that the object you get back from the call to raw() is a RawQuerySet, which is an iterator just like the regular QuerySet. However, it lacks an all() method, which is just as well, given that the RawQuerySet is already an iterator, giving the appropriate records when requested (and not before).

Note that for commonly used SQL functions, such as COUNT, there are built-in Django methods that handle such things. So if you're counting, sorting or grouping, you shouldn't need to step down to the SQL level. And as a general rule, you don't want to do that. However, there are times when it comes in handy—particularly if you're trying to reduce the amount of data you'll have to handle in Python.

Loops and Joins

The final anti-pattern is something I just saw at a client's office several days before writing this. The company has a large number of products and wants to perform a query for each of the products. So, they did something like this:

>>> products = Product.objects.all()
>>> for product in products:
        ProductInfo.objects.filter(product_id=product.id).all()

This query took a very long time to run. Why? Because for each of the thousands of products, they were then issuing an additional SQL query. The funny thing was that each individual query executed quickly, so it didn't show up in our PostgreSQL slow-query logging monitor. But the effect of executing such a query was dramatic and ended up taking many minutes.

The solution was to turn our many queries into a single query. In SQL, we would use an inner join. And indeed, when I used an inner join in raw SQL, we found that instead of taking several minutes to execute, it took 1.5 seconds—obviously, a huge time savings.

There are two possible solutions in Django for this problem. The first is to use a raw SQL query, as I showed above. That's not an ideal solution, particularly since the whole idea of an ORM is to remove the use of SQL and stay within a single language (Python, in this case). But there are times when you cannot avoid it.

However, if you want to be smarter about it, you can use Django's selected_related method. This allows you to retrieve not just one model, but a related model—in effect, creating a join in your database and producing one large query instead of many small ones. The effect on the performance of your application may well be dramatic in such a case, as I discovered when working with my client.

Conclusion

Object-relational mappers are wonderful things. However, at the end of the day, sometimes they can fool you into forgetting that there is a cost (in time and space) to bringing your data from the database into your language. Most modern frameworks try to help by using lazy-loading and iterators, such that you retrieve individual records and not the entire data set. However, it's all too easy to retrieve everything at once, or make your application work too hard, or even to invoke too many queries on your database.

Resources

The Django documentation is at https://docs.djangoproject.com. Look for the QuerySet documentation to see more about this subject.

If you're using Ruby on Rails, you should look at the documentation for ActiveRecord at rubyonrails.com. In particular, see the “lazy-loading” features that are now standard in ActiveRecord.

Finally, Pat Shaughnessy wrote a fantastic blog post on this subject, looking at Ruby on Rails and PostgreSQL. It's worth reading even if you don't use these specific technologies to understand the implications of bringing data out of the database. His piece is at patshaughnessy.net/2015/6/18/dont-let-your-data-out-of-the-database.