GitHub - maxibor/floria-strainer: An add-on to floria for clustering reads into strains

floria-strainer: strains out genomes from floria.

Introduction

Given the output of the strain haplotyping software floria¹ , floria-strainer computes the allele frequency at each variable position of each haploset identified by floria to cluster them into the different strains composing the mixture, using a Gaussian Mixture Model.

Install

pip install floria-strainer

Quick start

Running floria-strainer on the provided test data

$ floria-strainer -b tests/data/test_short.bam tests/data/floria_out_dir
INFO - Writing the straining informations to floria_strained.strained.csv.
INFO - 2 strains were found: 0, 1
INFO - Writing the BAM file in tag mode to floria_strained.bam.

Running floria and floria-strainer on your own data

1. Variant calling

The floria authors recommend freebayes² to call the variants from your alignment files. In this step, Freebayes is used in "naive" mode, just to count all the "alleles" present at each variable position, using the --pooled-continous option.

freebayes \
-f reference.fa \
--pooled-continuous \
alignment.bam > freevayes_output.vcf

2. Running floria

floria \
-b alignment.bam \
-v freevayes_output.vcf \
-r reference.fa

The floria output will be written by default to the floria_out_dir directory

3. Running floria-strainer

floria-strainer `tag` mode

floria-strainer \
--bam alignment.bam \
-m tag \
floria_out_dir

floria-stainer in the tag mode will produce a bam file (by default, floria_strained.bam) which contains the strain-phasing information in the ST tag, and the haploset information in the HP tag.

floria-strainer `split` mode

floria-strainer \
--bam alignment.bam \
-m split \
floria_out_dir

floria-stainer in the split mode will produce one bam file per detected strain (by default floria_strained.{strain_number}.bam). Reads not being part of clustered to a specific strain will be written to all bam files.

Visualizing the strain clustering method

In this example, short-reads are aligned to a reference genome (fig 1). At each variable position, we can observe a 2 different alleles, which in the case of this haploid organism, corresponds to a mixture of 2 different strains.
floria-strainer takes the output of floria, with reads having been assigned to a haploset (fig 2) based on the MEC criteria.

Expand to see a high level overview of floria

The haplotyping process floria is conceptually similar to the one of de novo assembly of reads into contigs, where instead of contigs, read haplosets are the results of the flow (ie. least costly path) optimization through the different reads represented as a graph. More details in the floria article ¹.

Based on the average allele frequency of each haploset, reads are clustered in the different strains. In this example (fig 3), there are two strains of minor and major allele frequency, which floria-strainer clustered in strain 0 and strain 1.

Reads that weren't assigned to any haploset by floria, or whose haploset do not cluster well enough are not assigned to any strain. They are considered to be shared by the different strains present in the alignment.

Fig 1: Reads aligned to the reference genome, visualized in IGV. The top track represents the reference genome, with variants indicated in the different colors.

Fig 2: Reads are grouped and colored by the HP HaPloset tag as annotated by Floria.

Fig 3: Reads are grouped and colored by the ST STrain tag as annotated by floria-strainer.

Working with Ancient DNA (aDNA)

In the case of very short DNA fragments typically observed in aDNA samples, the following parameter set have been found to work well. Tweak them for your own needs, and experiment as needed:

For freebayes variant-calling

freebayes \
-f reference.fa \
--min-base-quality 20 \
--min-mapping-quality 30 \
--min-coverage 6 \
--limit-coverage 30 \
--pooled-continuous \
--use-best-n-alleles 3 \
alignment.bam > freebayes_output.vcf

For floria

floria \
-b alignment.bam \
-r reference.fa \
-v freebayes_output.vcf \
--snp-density 0.00001 \
--snp-count-filter 10 \
--ploidy-sensitivity 3 \
--threads 20

For floria-strainer (in `tag` or `split` mode)

floria-strainer \
--bam alignment.bam \
--hapq-cut 0 \
floria_out_dir

Help

$ floria-strainer --help
                                                                                                                                                                                                                     
 Usage: floria-strainer [OPTIONS] FLORIA_OUTDIR                                                                                                                                                                      
                                                                                                                                                                                                                     
 Strain the haplotypes in the floria output directory.                                                                                                                                                               
 Author: Maxime Borry                                                                                                                                                                                                
                                                                                                                                                                                                                     
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│    --version                      Show the version and exit.                                                                                                                                                      │
│    --nb-strains  -n  INTEGER      Number of strains to keep. If 0, the number of strains will be determined by the mean floria average strain count with HAPQ > HAPQ treshold [default: 0]                        │
│    --hapq-cut    -h  INTEGER      Minimum HAPQ threshold [default: 15]                                                                                                                                            │
│    --sp-cut      -s  FLOAT        Minimum strain clustering probability threshold [default: 0.5]                                                                                                                  │
│ *  --bam         -b  PATH         Input BAM file [required]                                                                                                                                                       │
│    --mode        -m  [tag|split]  BAM output mode. Tag: add ST (strain) tags to the reads. Split: split the reads in different BAM files per strain. [default: tag]                                               │
│ *  --basename    -o  TEXT         Output file basaneme [default: floria_strained] [required]                                                                                                                      │
│    --help                         Show this message and exit.                                                                                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Tests

After cloning/pulling repo

$ pip install poetry pytest
$ poetry run pytest -vv

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
assets/img		assets/img
floria_strainer		floria_strainer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Install

Quick start

Running floria and floria-strainer on your own data

1. Variant calling

2. Running floria

3. Running floria-strainer

floria-strainer `tag` mode

floria-strainer `split` mode

Visualizing the strain clustering method

Working with Ancient DNA (aDNA)

For freebayes variant-calling

For floria

For floria-strainer (in `tag` or `split` mode)

Help

Tests

About

Releases 3

Packages

Languages

License

maxibor/floria-strainer

Folders and files

Latest commit

History

Repository files navigation

Introduction

Install

Quick start

Running floria and floria-strainer on your own data

1. Variant calling

2. Running floria

3. Running floria-strainer

floria-strainer tag mode

floria-strainer split mode

Visualizing the strain clustering method

Working with Ancient DNA (aDNA)

For freebayes variant-calling

For floria

For floria-strainer (in tag or split mode)

Help

Tests

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

floria-strainer `tag` mode

floria-strainer `split` mode

For floria-strainer (in `tag` or `split` mode)

Packages