Speed up batch preparation by at least ~70×, reduce time by at least 3 days #18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
One of the long steps in the pipeline is batch preparation, where the
mainAgeing
function is run withBatchProducer = TRUE
. It contains many levels of nested iteration, which on large files takes a very long time.The
python_prototype
directory in this pull request contains a prototype of a much faster implementation of the chunking code in Python.Here is the plot comparing performance of the old and new chunking pipelines depending on the size of input, in hours. Note that, using the existing implementation, the four largest input files did not even manage to complete in 84 hours (3.5 days, marked with × on the plot), while under the new implementation they each complete under 1 hour of time:
![image](https://private-user-images.githubusercontent.com/113850522/262386450-c2ede871-fa45-47f0-859e-596c7f6e185c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMzAxMjIsIm5iZiI6MTczOTMyOTgyMiwicGF0aCI6Ii8xMTM4NTA1MjIvMjYyMzg2NDUwLWMyZWRlODcxLWZhNDUtNDdmMC04NTllLTU5NmM3ZjZlMTg1Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQwMzEwMjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03MDBiMzMzOGEzMzVjZjZhZDg3ZWZkY2E2ZWQ5YWRjNWY1ZmQ2YTA1N2UxZjM3Njg5YTY5MGU3OGFmODJkNmY0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.TEz9v8u_2EpKncHut573RtBZiI-Mj94iMfK_9R31uu4)
Full benchmark results are available in this spreadsheet. In terms of CPU hours, the new implementation is at least 68× more efficient. In terms of minimum wall time (time required for the largest input file to process), the speed up is at least 91×. Because the four largest files didn't compute in 84 hours, all those results are a lower bound and in reality speed up will be even greater.
The high performance is achieved by:
The algorithm used for chunking is exactly the same as in the original pipeline.