Downstream analysis of PopPUNK results. Produces subclusters and visualisations of all strains.
Further documentation can be found in the PopPUNK docs.
The default pipeline consists of the following steps:
- Split files into their PopPUNK strains.
- Use pp-sketchlib to calculate core and accessory distances within each strain.
- Use core distances and rapidnj to make a neighbour-joining tree.
- Use SKA2 to generate within-strain alignments in reference free mode.
- Use IQ-TREE to generate an ML phylogeny using this alignment, and the NJ tree as a starting point.
- Use fastbaps to generate subclusters which are partitions of the phylogeny.
Default (subclusters) snakemake
:
In addition to the above:
- Create an overall visualisation with both core and accessory distances, as in PopPUNK. The final tree consists of refining the NJ tree by grafting the maximum likelihood trees for subclusters to their matching nodes.
- Use microreact to display the results.
snakemake make_microreact
:
In addition to the above, for each strain:
- Use SKA2 map to generate within-strain alignments in reference-based mode.
- Use gubbins to remove recombination.
- Use bactdating to make timed trees.
- Use transphylo to infer transmission events on these timed trees.
This requires a transmission_metadata.csv
file containing sampling times, see the PopPIPE configuration section below for a description of its format.
snakemake transmission
:
The supported method is to use conda, which is most easily accessed by first installing miniconda. PopPIPE simply depends upon snakemake and pandas:
conda install snakemake pandas
Other dependencies will be automatically installed by conda the first time
you run the pipeline. You can also install them yourself and omit the --use-conda
directive to snakemake:
conda env create -n poppipe --file=environment.yml
If the package cannot be found you will need to add the necessary channels:
conda config --add channels r
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
An alternative, if you are having trouble with the above, is to use the PopPIPE docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:
docker pull poppunk/poppipe:latest
You can also follow the above process and make a local clone of snakemake, and replace
--use-conda
with --use-singularity
, which will automatically pull this container, and run
each step inside it.
Use --singularity-args
if you need to bind directories.
- Modify
config.yml
as appropriate. - Run
snakemake --cores <n_cores> --use-conda
.
In particular, check the three poppunk_
arguments, which should be set to the
full path of the --r-files
argument, strain clusters .csv file and .h5
database file,
from your PopPUNK run.
On a cluster or the cloud, you can use snakemake's built-in --cluster
argument:
snakemake --cluster qsub -j 16 --use-conda
See the snakemake docs for more information on your cluster/cloud provider.
The default target is the first in the Snakefile: cluster_summary
. This
produces subclusters but no visualisations. To continue the run forward from here
use the microreact
target:
snakemake --use-conda make_microreact
This will create a phylogeny, embedding and format your strains and their subclusters
for microreact and save these files to the output. The phylogeny
and clusters will be sent to microreact, and a link to your page will be output to the terminal
and saved in output/microreact_url.txt
.
NB From 2021-10-27 Microreact requires an API key for the final step to work. See the microreact docs for instructions on how to generate one for your account.
poppipe_location
: Thescripts/
directory, if not running from the root of this repositorypoppunk_rfile
: The--rfile
used with PopPUNK, which lists sample names and files, one per line, tab separated.poppunk_clusters
: The PopPUNK cluster CSV file, usuallypoppunk_db/poppunk_db_clusters.csv
.poppunk_h5
: The PopPUNK HDF5 database file.transmission_metadata
: (optional) a CSV file with strain names in a column labelled 'Name' and sampling dates labelled 'Date'. The date can be in format YYYY-mm-dd, YYYY/mm/dd or just the year YYYY.min_cluster_size
: The minimum size of a cluster to run the analysis on (recommended at least 6).
fastq_qual
: With read input, the-q
option, which ignores k-mers with bases below this score.fastq_cov
: With read input, the-c
option, which sets a minimum k-mer count.kmer
: The k-mer size, choose longer k-mers for less diverse clusters.single_strand
: Set to true is sequences are single-straded and phased (e.g. RNA viruses).freq_filter
: Minimum frequency of samples a split k-mer must appear in to be in the alignment.
enabled
: Set tofalse
to turn off ML tree generation, and use the NJ tree throughout.mode
: Set tofull
to run with the specified model, set tofast
to run using--fast
(like fasttree).model
: A string for the-m
parameter describing the model. Adding+ASC
is not recommended.
levels
: Number of levels of recursive subclustering.script
: Location of therun_fastbaps
script. Find by runningsystem.file("run_fastbaps", package = "fastbaps")
in R.
knn
: Number of nearest neighbours (at least two).perplexity
: Perplexity parameter for t-SNE (between 5 and 50).maxIter
: Iterations in the optimisation (at least 10000, default 100000).
name
: Title of the Microreact to producewebsite
: Website link to give in Microreachemail
: Contact email to list in Microreactapi_token
: The API token from your Microreact account
prefix
: Folder name for gubbins resultstree_builder
: Program to use to build trees.min_snps
: Min SNPs to identify a recombination blockmin_window_size
: Minimum window size of blocksmax_window_size
: Maximum window size of blocksiterations
: Maximum iterations of tree building and removal
dateT
: Date when transmission stops. If zero, the maximum date is used. If supplied, this needs to be on the same scale as bactdating, which is number of years since 1970 (i.e. Unix format divided by 365).
Otherwise these options are passed to the inferTTree()
method in transphylo and we refer users to the reference documentation.
You can use the helper script poppipe_assign.py
to help you re-run after query assignment. For example, if you assigned to a database with:
poppunk_assign --db listeria_rlist --query qlist.txt --output listeria_qlist
Give the same arguments, and your config file to python poppipe_assign.py
:
python poppipe_assign.py --db listeria_rlist --query qlist.txt --output listeria_qlist --config config.yml
This will generate combined input files, and a new config file. Run snakemake again:
snakemake --configfile configv2rfmbr9.yml
Here, the first snakemake pipeline ran on four strains consisting of 83 samples:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 cluster_summary
4 fastbaps
4 generate_nj
4 iq_tree
4 ska_align
58 ska_index
4 sketchlib_dists
4 split_strains
83
The second one, with the new queries, had one more strain, and nineteen new samples to index:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 cluster_summary
5 fastbaps
5 generate_nj
5 iq_tree
5 ska_align
19 ska_index
5 sketchlib_dists
5 split_strains
50
NB: This will re-run all downstream steps for each strain, other than the ska index
steps. If you have a small number of strains being changed this is likely to be inefficient. If you would like us to support this type of analysis please get in touch.