Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User Story] Improve runtime #184

Open
11 of 13 tasks
fellen31 opened this issue Jun 12, 2024 · 1 comment
Open
11 of 13 tasks

[User Story] Improve runtime #184

fellen31 opened this issue Jun 12, 2024 · 1 comment
Assignees
Labels
Effort L Effort Large Gain L Gain Large needs refinement This issue needs refinement Urgency M Urgency Medium user story a user story describing new functionality

Comments

@fellen31
Copy link
Collaborator

fellen31 commented Jun 12, 2024

Need

As a pipeline developer, I want to be able to quickly iterate and run the pipeline on a big test dataset before a release. I'd like there to be comprehensive tests for the large test data as well, to find bugs before a release that are not caught with the small test data.

Since there are no AWS runners set up, I need to do this manually which is time consuming. Therefore I'd like it to be as fast as possible.

Suggested approach

Identify the most time consuming tasks, and reduce a few of them:

  • Hifiasm (~5h) - runtime can't really be improved, but by splitting the process in two we could make use cache in case of a failed run.
  • Dipcall (5-6h) - This module could be deprecated and switched out to just align the assemblies back to the reference.
  • whatshap phase (~3h) - is single-threaded and takes several hours to run. It could be run per chromosome, or by default be replaced with LongPhase which runs in ~10 min.
    • whatshap haplolag (~40min) - Could by default be replaced with LongPhase which runs in ~10 min.
  • DeepVariant - Could be further parallelized with smaller windows.
  • VEP (~45 min) - Could be run per chromosome/region, same as variant calling.
  • The uBAM files could be split directly with splitbam and the minimap2 module patched to pipe the output of samtools fastq into minimap2. This would also reduce storage footprint. However, uBAM to fastq conversion would need to happen in parallel for hifiasm.
    • Current: samtools fastq (25 min) -> fastp (8 min) -> minimap2 (~2 min) -> samtools_merge (~5 min), total ~45 min.
    • Alternative: splitbam (~2 min) -> minimap2 (~2 min) -> samtools merge (~5 min), total ~9 min.
    • Known sex alternative: splitbam (~2 min) -> minimap2 (~2 min) -> samtools merge (~1 min) -> deepvariant, total ~5 min.

Can be closed when

Assembly

Phasing

Mapping and preprocessing

SNV calling

SNV annotation

@fellen31 fellen31 added needs refinement This issue needs refinement user story a user story describing new functionality labels Jun 12, 2024
@fellen31 fellen31 self-assigned this Jun 12, 2024
@fellen31 fellen31 added Gain L Gain Large Effort S Effort Small Effort L Effort Large Urgency L Urgency Large Urgency M Urgency Medium and removed Effort S Effort Small Urgency L Urgency Large labels Jun 16, 2024
@fellen31
Copy link
Collaborator Author

In addition:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Effort L Effort Large Gain L Gain Large needs refinement This issue needs refinement Urgency M Urgency Medium user story a user story describing new functionality
Projects
Status: No status
Development

No branches or pull requests

1 participant