Avoid Memory Issues with Django’s bulk_create

When inserting a large number of objects into the database with Django, your first thought should be to use bulk_create. It is much more efficient than calling create for each object, and it generally only results in a single query. However, when dealing with tens of thousands, hundreds of thousands, or even more objects, you may run into out of memory errors.

Problem

Consider the following code:

from my_app.models import MyModel


def create_data(data):
    objs = []
    for row in data:
        obj = MyModel(field1=data['field1'])
        objs.append(obj)

    MyModel.objects.bulk_create(objs)

This code works fine and is efficient from a database perspective for most cases. However, as data grows large, objs has the potential to take up too much memory and cause errors.

Using Generators

This is a solved problem in Python. Just use a generator. A generator is a function that can be iterated over. The benefit of this is that the full results of the function are not loaded into memory at once; they are loaded as the iteration occurs. We can use this to our advantage:

from my_app.models import MyModel


def create_data(data):
    MyModel.objects.bulk_create(generator())


def generator(data):
    for row in data:
        yield MyModel(field1=data['field1'])

This would work, in theory. Unfortunately, at the time of writing, Django’s bulk_create converts its iterable argument into a list. This causes the generator to be fully evaluated and thus negates the purpose of it.

Workaround

Since bulk_create converts the generator into a list, we need to split the data into batches and pass each chunk in one at a time.

from itertools import islice

from my_app.models import MyModel


def create_data(data):
    bulk_create(MyModel, generator())


def bulk_create(model, generator, batch_size=10000):
    """
    Uses islice to call bulk_create on batches of
    Model objects from a generator.
    """
    while True:
        items = list(islice(generator, batch_size))
        if not items:
            break
        model.objects.bulk_create(items)


def generator(data):
    for row in data:
        yield MyModel(field1=data['field1'])

Here, we use islice to slice a generator (without fully evaluating it) and pass each batch into bulk_create. This makes it so that only batch_size items are loaded into memory at a time. This comes at the cost of more database queries, so pick a batch size that is high enough to be performant, but low enough to avoid out of memory errors.

Closing Thoughts

For more Django optimization tips, check out Django ORM Optimization Tips.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.