Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

marinak-ebi
Copy link
Contributor

One of the long steps in the pipeline is batch preparation, where the mainAgeing function is run with BatchProducer = TRUE. It contains many levels of nested iteration, which on large files takes a very long time.

The python_prototype directory in this pull request contains a prototype of a much faster implementation of the chunking code in Python.

Here is the plot comparing performance of the old and new chunking pipelines depending on the size of input, in hours. Note that, using the existing implementation, the four largest input files did not even manage to complete in 84 hours (3.5 days, marked with × on the plot), while under the new implementation they each complete under 1 hour of time:
image

Full benchmark results are available in this spreadsheet. In terms of CPU hours, the new implementation is at least 68× more efficient. In terms of minimum wall time (time required for the largest input file to process), the speed up is at least 91×. Because the four largest files didn't compute in 84 hours, all those results are a lower bound and in reality speed up will be even greater.

The high performance is achieved by:

  1. Not doing costly nested iteration on large files — instead, an efficient two pass sorting/chunking approach is used. This is very memory efficient and will be able to scale to much larger input sizes.
  2. Slightly changing the way output files are written. Instead of writing control lines to every chunk, they are written separately only once for every distinct group of chunks. This will require a very minor change to the rest of the pipeline (not yet implemented), but significantly saves on both space and computation time.

The algorithm used for chunking is exactly the same as in the original pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant