Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

marinak-ebi · 2023-08-22T14:36:28Z

One of the long steps in the pipeline is batch preparation, where the mainAgeing function is run with BatchProducer = TRUE. It contains many levels of nested iteration, which on large files takes a very long time.

The python_prototype directory in this pull request contains a prototype of a much faster implementation of the chunking code in Python.

Here is the plot comparing performance of the old and new chunking pipelines depending on the size of input, in hours. Note that, using the existing implementation, the four largest input files did not even manage to complete in 84 hours (3.5 days, marked with × on the plot), while under the new implementation they each complete under 1 hour of time:

Full benchmark results are available in this spreadsheet. In terms of CPU hours, the new implementation is at least 68× more efficient. In terms of minimum wall time (time required for the largest input file to process), the speed up is at least 91×. Because the four largest files didn't compute in 84 hours, all those results are a lower bound and in reality speed up will be even greater.

The high performance is achieved by:

Not doing costly nested iteration on large files — instead, an efficient two pass sorting/chunking approach is used. This is very memory efficient and will be able to scale to much larger input sizes.
Slightly changing the way output files are written. Instead of writing control lines to every chunk, they are written separately only once for every distinct group of chunks. This will require a very minor change to the rest of the pipeline (not yet implemented), but significantly saves on both space and computation time.

The algorithm used for chunking is exactly the same as in the original pipeline.

Marina and others added 15 commits August 16, 2023 14:22

Add two new modules for chunking

3f75d22

Make scripts runnable and add README

80a7247

Debug and modify the commands

852eeee

Do not parse control lines every time

7815f51

Fix ZIP compression settings

a63c201

Do not compress control file every time

4f84c5e

Combine steps into single script

4ec846c

Delete step_2_final_chunking.py

5a62ea3

Fetch all procedure ID blocks

f026767

Update running instructions

7853a46

Read uncompressed input file

49981b0

Fix: fetch data in CSV format

1545dd0

Handle case where no controls are present

38c906d

Fix handling file names

37b9cc5

Fix control file name

c5b0b9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

marinak-ebi commented Aug 22, 2023

Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

Are you sure you want to change the base?

Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18

Conversation

marinak-ebi commented Aug 22, 2023