This pipeline is part of a tutorial. Please visit https://krakenflowtutorial.netlify.app/ to learn how it was built.
It is a reproducible Nextflow pipeline for fast, accurate taxonomic profiling of short‑read metagenomic datasets. It combines host/contaminant filtering with Bowtie2, taxonomic assignment with Kraken2, abundance re‑estimation with Bracken, interactive visualization with Krona, and an optional multi‑sample report built with phyloseq/R Markdown.
- Filter reads with Bowtie2: removes reads mapping to a reference (e.g., yeast) to reduce false positives downstream.
- Classify with Kraken2: generates per‑sample classification output and a
.k2reportsummary. - Re‑estimate abundances with Bracken: produces species‑level abundance tables (
.bracken) and human‑readable reports (.breport). - Visualize with Krona: creates an interactive
.krona.htmlper sample. - (Optional) Multi‑sample BIOM + report: merges Bracken outputs into
merged.biomand rendersreport.htmlvia an R Markdown template when a samplesheet is used.
- Nextflow:
>=22.10.0recommended. Install: https://www.nextflow.io/ - Container engine: Docker enabled by default in
nextflow.config.- Alternative engines (Podman/Singularity/Apptainer) can work with small config tweaks.
- Databases/Indexes:
- A valid Bowtie2 index for the host/contaminant you wish to filter (example yeast index is included).
- A Kraken2 database (e.g., Standard, PlusPF, or a custom DB). See: https://github.com/DerrickWood/kraken2/wiki/Manual#kraken-2-databases
main.nfandworkflow.nf: pipeline entry and orchestrationmodules/: individual process modules (bowtie2,kraken2,bracken,kReport2Krona,ktImportText,kraken_biom,knit_phyloseq)data/yeast/: example Bowtie2 yeast indexdata/samples/: example FASTQs layoutdata/samplesheet.csv: example multi‑sample sheet (columns:sample_id,fastq_1,fastq_2)bin/report.Rmd: R Markdown template used to build the optional multi‑sample report
You can provide input reads in two ways:
- FASTQ glob pattern via
--reads
- Example pattern:
"data/samples/*/*_{1,2}.fastq" - The pipeline will pair
*_1.fastq/*_2.fastqautomatically per sample.
- Samplesheet via
--sheet_csv
- CSV with headers:
sample_id,fastq_1,fastq_2 - See the included
data/samplesheet.csvfor the expected format.
--reads: Glob for paired FASTQs (mutually exclusive with--sheet_csv).--sheet_csv: CSV withsample_id,fastq_1,fastq_2(enables BIOM + report).--outdir: Where results are written. Default is set innextflow.config.--bowtie2_index: Prefix to the Bowtie2 index used for filtering.--kraken2_db: Path to your Kraken2 database directory.--report: Path to the Rmd template for the multi‑sample report.
Defaults in nextflow.config point to training paths; you will typically override them on the command line.
Make sure you have Nextflow and Docker installed, then from the KrakenFlow directory:
Basic run with the included example yeast index and sample layout (replace the Kraken2 DB path):
nextflow run main.nf \
--reads "data/samples/*/*_{1,2}.fastq" \
--outdir "output" \
--bowtie2_index "data/yeast/yeast" \
--kraken2_db "/path/to/kraken2_db"Run using a samplesheet (enables merged BIOM + R Markdown report):
nextflow run main.nf \
--sheet_csv "data/samplesheet.csv" \
--outdir "output" \
--bowtie2_index "data/yeast/yeast" \
--kraken2_db "/path/to/kraken2_db" \
--report "bin/report.Rmd"Notes:
- The included yeast Bowtie2 index prefix is
data/yeast/yeast. - You must supply a real Kraken2 database path; this repository does not include one.
For each sample_id in --outdir:
${sample_id}.sam: filtered alignments (from Bowtie2)${sample_id}.k2report: Kraken2 report${sample_id}.kraken2: Kraken2 classifications${sample_id}.bracken: Bracken abundance estimates (species level by default)${sample_id}.breport: Bracken human‑readable report${sample_id}.b.krona.txt: Krona input table${sample_id}.krona.html: interactive Krona chart
For multi‑sample runs (when --sheet_csv is used):
merged.biomin--outdirreport.htmlin--outdir(phyloseq‑based overview)
All tools run inside pinned containers (see modules/*.nf). Docker is enabled in nextflow.config; Nextflow will automatically pull the required images on first use.
- If Docker is not available, set an alternative executor in
nextflow.config(e.g.,podman,singularity) and ensure the images are accessible. - Ensure your Kraken2 DB was built with k‑mer length and read length appropriate for your data; adjust Bracken parameters if you change read length.
- For large datasets, set a custom
--outdiron a high‑capacity filesystem.
If you use this workflow, please cite the underlying tools:
- Wood DE, Lu J, Langmead B. Kraken 2. Genome Biology (2019).
- Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken. PeerJ Computer Science (2017).
- Langmead B, Salzberg SL. Bowtie 2. Nature Methods (2012).
- Ondov BD et al. Krona. BMC Bioinformatics (2011).
And the training material adapted for the R Markdown report as referenced in bin/report.Rmd.

