Skip to content

bluenote-1577/dbghaplo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbghaplo - long-read haplotypes from mixtures of "small" sequences

dbghaplo is a method that separates long reads (Nanopore or PacBio) of a mixture of sequences into groups with similar alleles. This is called "phasing" or "haplotyping".

dbghaplo is a "local haplotyping" method, so it works best when the sequence of interest is approximately the size of the reads. For genome-scale haplotyping, consider another tool such as floria.

Example use cases:

  • mixed viral long-read samples (e.g. co-infections)
  • amplicon/enriched sequencing of specific genes
  • haplotyping small sections of multi-strain bacterial communities

High-depth, heterogeneous sequencing that spans a 1kb gene.

Separated groups ("haplotypes") after running dbghaplo.

Why dbghaplo?

Similar tools exist for detection of similar haplotypes in mixtures. dbghaplo was developed to fill the following gaps:

  • Speed and low-memory - dbghaplo scales approximately linearly with sequencing depth and # of SNPs. > 30,000x coverage genes can be haplotyped in minutes.
  • High heterogeneity and coverage - dbghaplo uses a de Bruijn Graph approach, which works with very diverse samples (> 10 haplotypes)
  • Ease-of-use + interpretable outputs - conda installable, engineered in rust, simple command line. Outputs are easy to interpret (haplotagged BAM or MSA).

Install

mamba install -c bioconda dbghaplo
dbghaplo -h 

See the installation instructions on the wiki if you want to compile directly or want a static binary. This is necessary if you're not on x86 architectures.

Quick Start after install

Option 1 (more flexible): Running dbghaplo with VCF + BAM

git clone https://github.com/bluenote-1577/dbghaplo
cd dbghaplo
dbghaplo -b hiv_test/3000_95_3.bam  -v hiv_test/3000_95_3.vcf.gz  -r hiv_test/OR483991.1.fasta

# results folder
ls dbghaplo_output

Option 2 (easier): Running dbghaplo with reads

git clone https://github.com/bluenote-1577/dbghaplo
cd dbghaplo
run_dbghaplo_pipeline -i hiv_test/3000_95_3.fastq.gz  -r hiv_test/OR483991.1.fasta --overwrite -o pipeline_output/ 

# results folder
ls pipeline_output

# intermediate files (bam + vcf files)
ls pipeline_output/pipeline_files

Note

If you did not install via conda, do the following instead.

mamba install -c bioconda tabix samtools lofreq minimap2
git clone https://github.com/bluenote-1577/dbghaplo
./dbghaplo/scripts/run_dbghaplo_pipeline -i reads.fq.gz -r reference.fa -o pipeline_output

Manuals, tutorials, and cookbook

How to use dbghaplo

  • Output format - for more information on how to interpret outputs.
  • Cookbook - see here for usage examples.

Tutorials

  • Forthcoming.

Citation

Forthcoming.