bacpaq
is a bioinformatics best-practice pipeline for bacterial genomic analysis for short-reads (Illumina) and long-reads (Oxford Nanopore) sequencing data. Currently bacpaq
supports WGS-based analyses, however, we plan to integrate Microbiome (Amplicon and Shotgun Metagenomics) analyses in future.
bacpaq
contains two high-level workflows; quality control, and annotation which aare supported by several sub-workflows as described below.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
raw-reads-qc
This sub-workflow processes raw sequencing reads such that only high-quality reads and/or sequences can be left to ensure the reliability and accuracy of downstream analyses. The sub-workflow works differently depending on the sequencing platform.
For paired-end short reads (specifically Illumina), initially, a set of raw reads are randomly sampled to a specified coverage using Rasusa. For the subsampling, the options --genome_size
, with the size of reference genome, and --depth_cut_off
, with the cut off value for read depth, should be given. The subsampled reads get trimmed with one of the trimming tools, which should be specified using the option --trim_tool
. The available tools include fastp, Trimmomatic, and Trim Galore. The QC reports for the set of reads before and after trimming are generated using FastQC and MultiQC. Additionally, the trimmed reads go under bacterial intra-species contamination using ConFindr
. Sub-sampling and contamination detection steps can be skipped using the flags --skip_subsampling
and --skip_confindr
, respectively.
For long read sequencing data (specifically Oxford Nanopore), the sub-workflow starts with adaptor trimming employing Porechop_ABI. Then, a set of reads are arbitrarily sampled using Rasusa like in paired-end short reads, as described above. The QC reports for processed reads are produced using NanoComp and pycoQC. For the generation of QC reports, summary files generated from Guppy should be given too. All the steps can be skipped using the flags --skip_porechop
, --skip_subsampling
, --skip_quality_report
, and --skip_pycoqc
.
The entire raw read QC can be skipped with the flag
--skip_raw_qc
.
Taxonomy-qc
The sub-workflow TAXONOMY_QC classifies the input reads (in .fastq
format) into taxonomic units and performs de-hosting if necessary. By default, aligned and unaligned reads are saved. For taxonomic classification, users can choose between Kraken2
(default) and centrifuge
( --classifier=“centrifuge”
). This step also created krona
type plots for taxonomic visualization. De-hosting requires a reference genome to map against, and it can be performed with minimap2
(default) or BWA
( --dehosting_aligner=“bwa”
). After alignment, it uses samtools
to separate aligned and unaligned reads.
The entire taxonomy QC can be skipped with the flag
--skip_taxonomy_qc
. OR dehosting task can be skipped with the flag--skip_dehosting
Assembly
The Assembly subworkflow performs de novo genome assembly on quality-filtered/trimmed reads produced by the Raw_Reads_QC subworkflow.
For paired-end short reads (specifically Illumina), several assembler options are available, including spades, skesa, megahit, velvet, which can be specified using the option --sr_assembler
. For long read sequencing data (specifically Oxford Nanopore), Dragonflye is the only supported long read assembler at the moment. Error-prone long read genome assemblies can be further polished using medaka or racon both of which can be repeated multiple times by specifying --medaka_rounds
or --racon_rounds
. Set the number of polishing rounds to 0 to disable post-assembly polishing. You can specify medaka models trained on specific basecalling models and sequencing kits for genome polishing using --medaka_model
.
Currently, the assembly subworkflow does not support single-end short reads or hybrid assembly
Assembly-qc
This sub is designed for quality control of genome assemblies, integrating three main tools: CHECKM, QUAST, and BUSCO. These tools evaluate the quality, completeness, and contamination of genome assemblies. You may skip any of these steps by using options --skip_checkm
, --skip_quast
and --skip_busco
.
Gene-prediction
The Gene-prediction subworkflow predicts and annotates the coding sequences of bacterial genome assemblies using Prokka and/or Bakta. Bakta requires a database whose path can be specified using the option --bakta_db
. Otherwise, it will download the database into local storage. Optionally, a pre-trained gene model can be provided using the option --prodigal_training_file. User provided Genbank or Protein FASTA file(s) that you want to annotate genes from can be provided using the option --annotation_protein_file
. Gene-prediction using both tools are executed by default, which can be disabled using the option, --skip_prokka
and/or --skip_bakta
.
AMR
The AMR subworkflow identifies AMR genes using RGI, AMRFinderPlus, ABRicate, abriTAMR, and Resfinder. Both Resfinder and AMRFinderPlus requires a database whose path can be specified using the options --resfinder_db
or --amr_finderplus_db
respectively. If not specified, the AMRFinderPlus database will be downloaded locally. Additionally, virulence factors can be identified using ABRicate. The AMR or virulence factor database used with ABRicate can be specified using the option --abricate_db
. Only one database can be specified at a time. All AMR tools are executed by default, and individual tools can can be disabled using the option --skip_abricate
, --skip_amr_annotation
, --skip_rgi
, --skip_abritamr
, or --skip_amrfinderplus
.
Phage
The Phage subworkflow classifies input sequences into virus taxonomies using VirSorter2. If the subworkflow is run for the first time, it downloads database files into local storage. If the database is already stored locally, it can be employed using the option, --virsorter_db
, while skipping downloading a new one. The virus classification can be skipped with the option, --skip_phage_annotation
.
Plasmid
The Plasmid subworkflow detects and types plasmids present in bacterial genome assemblies using PlasmidFinder and the MOB-recon tool from MOB-SUITE which additionally, can reconstruct individual plasmid sequences. Plasmid analysis using both tools are executed by default, which can be disabled using the option, --skip_mobsuite
and/or --skip_plasmidfinder
CRISPR
The CRISPR subworkflow identifies Cas operons, CRISPR arrays and spacer sequences in bacterial genome assemblies using CRISPRCasTyper. CRISPR identification is executed by default, which can be disabled using the option, --skip_crispr
Pan-genome analysis
The Pangenome_analysis subworkflow carries out a pangenome analysis with gene-annotated sequences in GFF3 format using Roary and PIRATE. Pangenome analysis using both tools are executed by default, which can be disabled using the option, --skip_roary
and/or --skip_pirate
.
-
Install
Nextflow
(>=22.10.1
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
- Download the pipeline and test it by printing the pipeline help message
nextflow run cidgoh/bacpaq -r [vers] --help
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.
- The pipeline comes with config profiles called
docker
,singularity
,podman
,shifter
,charliecloud
andconda
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
.- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment.- If you are using
singularity
, please use thenf-core download
command to download images first, before running the pipeline. Setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.- If you are using
conda
, it is highly recommended to use theNXF_CONDA_CACHEDIR
orconda.cacheDir
settings to store the environments in a central location for future pipeline runs.
- Start running your own analysis!
nextflow run -r [vers] cidgoh/bacpaq \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
cidgoh/bacpaq
was originally written by CIDGOH genomics group.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.