[User Story] Improve runtime #184

fellen31 · 2024-06-12T14:46:49Z

Need

As a pipeline developer, I want to be able to quickly iterate and run the pipeline on a big test dataset before a release. I'd like there to be comprehensive tests for the large test data as well, to find bugs before a release that are not caught with the small test data.

Since there are no AWS runners set up, I need to do this manually which is time consuming. Therefore I'd like it to be as fast as possible.

Suggested approach

Identify the most time consuming tasks, and reduce a few of them:

Hifiasm (~5h) - runtime can't really be improved, but by splitting the process in two we could make use cache in case of a failed run.
Dipcall (5-6h) - This module could be deprecated and switched out to just align the assemblies back to the reference.
whatshap phase (~3h) - is single-threaded and takes several hours to run. It could be run per chromosome, or by default be replaced with LongPhase which runs in ~10 min.
- whatshap haplolag (~40min) - Could by default be replaced with LongPhase which runs in ~10 min.
DeepVariant - Could be further parallelized with smaller windows.
VEP (~45 min) - Could be run per chromosome/region, same as variant calling.
The uBAM files could be split directly with splitbam and the minimap2 module patched to pipe the output of samtools fastq into minimap2. This would also reduce storage footprint. However, uBAM to fastq conversion would need to happen in parallel for hifiasm.
- Current: samtools fastq (25 min) -> fastp (8 min) -> minimap2 (~2 min) -> samtools_merge (~5 min), total ~45 min.
- Alternative: splitbam (~2 min) -> minimap2 (~2 min) -> samtools merge (~5 min), total ~9 min.
- Known sex alternative: splitbam (~2 min) -> minimap2 (~2 min) -> samtools merge (~1 min) -> deepvariant, total ~5 min.

Can be closed when

Assembly

Phasing

Mapping and preprocessing

[User Story] Treat BAM as the primary input #219
- Make splitbam available on bioconda fellen31/splitubam#1
- Add splitbam to nf-core
- Add samtools import for fastq-files
- Patch or update minimap2/align to allow BAM input, or create a new module
- Add nf-test to verify output from samtools merge

SNV calling

Add bedtools/makewindows to subdivide bedfiles #263

SNV annotation

fellen31 · 2025-02-18T13:48:20Z

In addition:

Should be investigated if providing a bgzipped reference to VEP improves time significantly.
filter_vep is too slow and should be replaced #522

fellen31 added needs refinement This issue needs refinement user story a user story describing new functionality labels Jun 12, 2024

fellen31 self-assigned this Jun 12, 2024

github-project-automation bot added this to Nallo Jun 12, 2024

fellen31 added Gain L Gain Large Effort S Effort Small Effort L Effort Large Urgency L Urgency Large Urgency M Urgency Medium and removed Effort S Effort Small Urgency L Urgency Large labels Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] Improve runtime #184

[User Story] Improve runtime #184

fellen31 commented Jun 12, 2024 •

edited

Loading

fellen31 commented Feb 18, 2025

[User Story] Improve runtime #184

[User Story] Improve runtime #184

Comments

fellen31 commented Jun 12, 2024 • edited Loading

Need

Suggested approach

Can be closed when

Assembly

Phasing

Mapping and preprocessing

SNV calling

SNV annotation

fellen31 commented Feb 18, 2025

fellen31 commented Jun 12, 2024 •

edited

Loading