A quick addition of multiprocessing.Pool for higher throughput#2
Open
cerebis wants to merge 2 commits intoLaboratorioBioinformatica:masterfrom
Open
A quick addition of multiprocessing.Pool for higher throughput#2cerebis wants to merge 2 commits intoLaboratorioBioinformatica:masterfrom
cerebis wants to merge 2 commits intoLaboratorioBioinformatica:masterfrom
Conversation
Author
|
I've done something similar to hmmscan with commit c189d61 On a small test set of 10 genomes, the run completed successfully. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change is not thoroughly tested nor implemented in a way that I would consider complete. I am providing it just as an exmaple, in case an maintainer wanted to fettle this into vHULK.
The intent is to remove the significant bottleneck that is Prokka annotation. Since Prokka is pretty lightweight, the viral genomes are small, and the tasks are embarrassingly parallel, it would be far faster to parallelise this step in an inverse fashion.
Therefore, I have just quickly hacked in a multiprocessing.Pool and call prokka with a single thread. I have also removed a bit of the print spam and set Prokka's verbose output to
/dev/null. Because I seem to love staring at progress bars, I also added tqdm.Instead of what looked to be hours, it now processes my 6200+ viral genomes in under 7 minutes with 50 cpus.
Now that the job is on to the step of hmmscan, I see that using the above strategy, it also would likely enjoy a similar speed up.