Fetching Records with Batches
There are some situations in which you must fetch a lot of records from the database, instantiate each model object and show them in your view.
There are many real-life situations where this can happen. I’ve had to deal with one recently as I was developing a real-estate website and had to list all districts, counties and parishes from Portugal in a search form. It’s a lot of coupled data which could be - and was - slowing my application down.
Sometimes things can get messy as they take a lot of time and/or eat up a lot of memory. There are three main downsides to this:
- The requesting user will wait some time until everything is loaded
- The operation will consume a lot of memory
- It may affect other users using your application, since less memory and CPU will be available for what they’re doing
We can avoid all these downsides by fetching data in batches. This way, we’ll go easier on memory since Rails won’t load the whole table(s) into memory. For more on Ruby’s issues in memory usage, please refer to my articles about Ruby’s garbage collector.
Until now we had a few solutions, based on plugins like active_record_each or “hacks” like faking-cursos-in-activerecord.
Rails 2.3 changed all this, since it natively implements two distinct methods regarding this feature:
We’ll have a look at their internals and benchmark them.
For comparison, we’ll pick up my old extreme blog example, which now has 30,081 comment entries. Let’s render them all, watch memory usage and how much time it takes using the 3 methods - the ones I stated above and a normal find.
Starting with the normal find, we have:
# comments controller @comments = Comment.find(:all, :include => :post) # comments view <% @comments.each do |comment| %> <%=h comment.post %> <%=h comment.content %> <% end %>
For all 211 seconds taken to process the request, the CPU usage was always between 80~100% and memory usage of our mongrel process around 158MB. The memory growth was quite linear, gaining ~1MB each second.
find_each
This method only retrieves a given number of records at a time, letting you iterate over the records as if it was a cursor. This way you’ll go easier on memory, making things faster and avoiding the badass ugly garbage collector.
Model.find_each(:conditions => ["coolness > ?", 0]) do |my_model| my_model.do_something! end
Pretty easy, just like a the normal find method.
There’s an extra option called :batch_size option which defaults to 1000 records. that represents the amount of records Rails will fetch in each DB query. Feel free to change it and do your own benchmarks, the best value for the batch highly depends on how much and what you’re fetching.
Be aware that you cannot use the :limit option with this method since :limit is used internally for fetching one batch at a time. The :order options will also be discarded, but this issue will probably be fixed in a near future. If you really need to order your find_each calls, you should have a look at the method grosser has developed.
Let’s benchmark it again, changing the previous code to:
# comments controller # we won't call this here since it'll dynamically give us the records to iterate (fetching 1000 at a time) # @comments = Comment.find_each(:all, :include => :post) #error # comments view <% Comment.find_each(:include => :post) do |comment| %> <%=h comment.post %> <%=h comment.content %> <% end %>
It took a bit less time, around 195 seconds, with the same amount of CPU usage. The main difference is in memory usage: 118MB. Also, the memory growth was much slower through the whole process, chewing a bit extra memory right in the end.
Summing up:
- Less 10% of time taken
- Less 25% of memory used
- Memory growth was much more subtle, only growing significantly by the end of the request processing
find_in_batches
This method is quite similar to find_each, except it gives you control back everytime it fetches a batch. Please note that you cannot specify an order, something explained in the apidock find_in_batches page:
It’s not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work. This also mean that this method only works with integer-based primary keys. You can’t set the limit either, that’s used to control the the batch sizes.
Similarly to find_each as well is the :batch_size option which can be specified and defaults to 1000. Another important note is that both find_each and find_in_batches support named scopes.
Heading on to benchmarking, we changed our code to:
# comments controller
# we won't call this here since many 1000 blocks will be returned
# @comments = Comment.find_each(:all, :include => :post) #error
# comments view
<% Comment.find_in_batches(:include => :post) do |batched_comments| %>
<% batched_comments.each do |comment| %>
<%=h comment.post %>
<%=h comment.content %>
<% end %>
<% end %>
Note that we need to act on each batch of comments, instead of expecting a stream of individual records like those find_each returns.
As of benchmarking, it took another 192 seconds like find_each, with the same amount of CPU usage of our previous tests. It was a tiny bit heavier on memory usage, with 121MB, but that “slow increase” I mentioned with find_each is even more prominent on this one, leaving more memory available for remaining processes.
On a final note, remember to use approaches like this when you have to deal with thousands of records (or more). If you’re dealing with less records then both find_each and find_in_batches are pretty useless.