Skip to content

A quick addition of multiprocessing.Pool for higher throughput#2

Open
cerebis wants to merge 2 commits intoLaboratorioBioinformatica:masterfrom
cerebis:faster_prokka
Open

A quick addition of multiprocessing.Pool for higher throughput#2
cerebis wants to merge 2 commits intoLaboratorioBioinformatica:masterfrom
cerebis:faster_prokka

Conversation

@cerebis
Copy link
Copy Markdown

@cerebis cerebis commented Aug 26, 2021

This change is not thoroughly tested nor implemented in a way that I would consider complete. I am providing it just as an exmaple, in case an maintainer wanted to fettle this into vHULK.

The intent is to remove the significant bottleneck that is Prokka annotation. Since Prokka is pretty lightweight, the viral genomes are small, and the tasks are embarrassingly parallel, it would be far faster to parallelise this step in an inverse fashion.

Therefore, I have just quickly hacked in a multiprocessing.Pool and call prokka with a single thread. I have also removed a bit of the print spam and set Prokka's verbose output to /dev/null. Because I seem to love staring at progress bars, I also added tqdm.

Instead of what looked to be hours, it now processes my 6200+ viral genomes in under 7 minutes with 50 cpus.

Now that the job is on to the step of hmmscan, I see that using the above strategy, it also would likely enjoy a similar speed up.

@cerebis
Copy link
Copy Markdown
Author

cerebis commented Aug 26, 2021

I've done something similar to hmmscan with commit c189d61

On a small test set of 10 genomes, the run completed successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant