Skip to content

benchling/sqlalchemy_batch_inserts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchling uses SQLAlchemy and psycopg2 to talk to PostgreSQL. To save on round-trip latency, we batch our inserts using this code.

Typically, creating and flushing N models in SQLAlchemy does N roundtrips to the database if the model has an autoincrementing primary key. This module improves creating N models to only require 2 roundtrips, without requiring any other changes to your code.

Here's our accompanying blog post that describes how we investigated and improved insert performance: sqlalchemy_batch_inserts: a module for when you’re inserting thousands of rows and it’s slow

Is this for me?

You may find use for this module if:

  • You are using SQLAlchemy
  • You are using Postgres
  • You sometimes need to create several models at once and care about performance

Usage

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy_batch_inserts import enable_batch_inserting

engine = create_engine("postgresql+psycopg2://postgres@localhost", executemany_mode="values")  # SQLAlchemy < 1.3.7 needs use_batch_mode=True instead
Session = sessionmaker(bind=engine)
session = Session()
enable_batch_inserting(session)

If you use Flask-SQLAlchemy,

from flask_sqlalchemy import SignallingSession
from sqlalchemy_batch_inserts import enable_batch_inserting

# Make sure that you've specified executemany_mode or use_batch_mode when creating your engine! Otherwise
# this library will not have any effect.
enable_batch_inserting(SignallingSession)

Demo

docker build -t sqla_batch .
docker run --rm -v $PWD:/src --name sqla_batch sqla_batch
# Wait for it to finish spinning up
# Switch to another shell
docker exec -it sqla_batch src/demo.py no 100
docker exec -it sqla_batch src/demo.py yes 100

To simulate 100 * 3 inserts with 20 ms latency, first change the connection string in demo.py from postgresql+psycopg2://postgres@localhost to postgresql+psycopg2://postgres@db. Then,

docker network create sqla_batch
docker run --rm --network sqla_batch --network-alias db --name db sqla_batch
# Switch to another shell
docker run -it -v /var/run/docker.sock:/var/run/docker.sock --network sqla_batch gaiaadm/pumba netem --duration 15m --tc-image gaiadocker/iproute2 delay --time 20 --jitter 0 db
# Switch to another shell
# This should take 100 * 3 * 20 ms = 6 seconds
docker run -it --rm -v $PWD:/src --network sqla_batch sqla_batch src/demo.py no 100
# This should take 4 * 20 ms = 0.08 seconds
docker run -it --rm -v $PWD:/src --network sqla_batch sqla_batch src/demo.py yes 100

Acknowledgements

This is all possible thanks to @dvarrazzo's psycopg2 execute_batch and @zzzeek's SQLAlchemy support for the same and helpful advice on the mailing list.

Maintainer notes

After bumping the version in setup.py and __version__ in __init__.py,

$ ./setup.py sdist
$ twine upload --repository-url https://test.pypi.org/legacy/ dist/*
# Check https://test.pypi.org/project/sqlalchemy-batch-inserts/
$ twine upload dist/*