UPHL-BioNGS/walkercreek is named after Walker Creek, which begins near Sunset Peak (elevation 10,088 ft) east of Meadow, Utah, and flows through Sunset Canyon. On the upper western-facing rocky slope of the canyon lies the resting place of Chief Walkara, also known as Chief Walker, a revered leader of the Utah Timpanogos and Sanpete Band of the Shoshone. Known for his penetrating gaze, he earned the nickname “Hawk of the Mountains.” He was a renowned diplomat, horseman, warrior, and military leader, famed for his role in raiding parties and the Wakara War. As a prominent Native American chief in Utah at the time of the Mormon Pioneers' arrival in 1847, he was renowned for his trading acumen, engaging with both European settlers and his own people. Chief Walker died of "lung fever" on January 29, 1855, and was buried with significant rituals, reflecting the deep respect he commanded within his community.
UPHL-BioNGS/walkercreek is a bioinformatics best-practice analysis pipeline designed for the assembly, classification, and clade assignment of Illumina paired-end influenza data using the nf-core template. Currently, this pipeline accepts the influenza modules provided by IRMA with "FLU" designated as the default module. Future versions plan to support the analysis of other viral pathogens found in IRMA's modules, including RSV.
IRMA is used for the adaptive NGS assembly of influenza and other viruses. It was developed by Samuel S. Shepard in collaboration with the Bioinformatics Team at the CDC’s Influenza Division. To gain insights into the innovative algorithm powering IRMA, refer to the IRMA manuscript. Due to the rapid evolution and high variability of viral genomes, IRMA avoids traditional reference-based assembly. It introduces a flexible, on-the-fly approach to reference editing, correction, and optional elongation, eliminating the necessity for external reference selection. This adaptability helps to ensure highly accurate and reproducible results.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Download SRA data into FASTQ format IF a list of SRA ids is provided as input sequences.
- Prefetch sequencing reads in SRA format (
SRATools_PreFetch
). - Convert the SRA format into one or more compressed FASTQ files (
SRATools_FasterQDump
).
Currently prepares influenza samples (paired-end FASTQ files) for assembly. These steps also provide different quality reports for sample evaluation.
- Combine FASTQ file lanes, if they were provided with multiple lanes, into unified FASTQ files to ensure they are organized and named consistently (
Lane_Merge
). - Remove human read data with the
NCBI_SRA_Human_Scrubber
for uploading reads to to public repositories for DNA sequencing data. - Filter unpaired reads from FASTQ files (
SeqKit_Pair
). - Trim reads and assess quality (
FaQCs
). - Remove adapter sequences and phix reference with (
BBMap_BBDuk
). - Generate a QC report by extracting data from the FaQCs report data (
QC_Report
). - Assess read data with (
Kraken2_Kraken2
) to identify the species represented. FastQC
- Filtered reads QCMultiQC
- Aggregate report describing results and QC from the whole pipeline
Clean read data undergo assembly and influenza typing and subtyping. Based on the subtype information, Nextclade variables are gathered.
- Assembly of influenza gene segments with (
IRMA
) using the built-in FLU module. Also, influenza typing and H/N subtype classifications are made. - Calculate the reference length, sequence length, and percent_coverage for segments assembled by IRMA with (
IRMA_SEGMENT_COVERAGE
) - Calculate the number of mapped reads and mean depth for segments assembled by IRMA with (
SAMTOOLS_MAPPED_READS
) - QC of consensus assembly (
IRMA_Consensus_QC
). - Generate IRMA consensus QC report (
IRMA_Consensus_QC_Reportsheet
) - Annotation of IRMA consensus sequences with (
VADR
) - Influenza A type and H/N subtype classification as well as influenza B type and lineage classification using (
Abricate_Flu
). The database used in this task is InsaFlu. - Generate a summary report for influenza classification results (
IMRA_Abricate_Reportsheet
). - Gather corresponding Nextclade dataset using the Abricate_Flu classifcation results (
Nextclade_Variables
).
Obtains datasets for Nextclade influenza genome analysis from the dataset determined by flu classification. Performs clade assignment, mutation calling, and sequence quality checks, followed by parsing the output report from Nextclade.
- Acquire the dataset necessary for influenza genome clade assignment with (
Nextclade_DatasetGet
). - Determine influenza genome clade assignment, perform mutation calling, and run sequence quality checks with (
Nextclade_Run
). Additionally, for each sample processed through (Nextclade_Run
), a phylogenomic dataset is generated named nextclade.auspice.json. This can be visualized using the auspice.us platform. - Parse the Nextclade output (
Nextclade_Parser
) and generate a report (Nextclade_Report
).
Compiles report sheets from modules and outputs a pipeline summary report tsv file.
- The (
Summary_Report
) consolidates and merges multiple report sheets into a single comprehensive summary report. - The (
merged_bam_coverage_results
) merges the gene segment report sheets detailing mapped reads, mean depth, reference length, sequence length, and percent_coverage.
-
Install
Nextflow
(>=22.10.1
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility. -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run main.nf -profile test,<docker/singularity> --outdir <OUTDIR>
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (
YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.- The pipeline comes with config profiles called
docker
,singularity
,podman
,shifter
, andcharliecloud
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
. - Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. - If you are using
singularity
, please use thenf-core download
command to download images first, before running the pipeline. Setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
- The pipeline comes with config profiles called
-
Start running your own analysis!
nextflow run main.nf -profile <docker/singularity> --input samplesheet.csv --outdir <OUTDIR>
-
It is advisable to delete large temporary or log files after the successful completion of the run. It takes a lot of space and may cause issues in future runs.
rm -rf work/ .nextflow* .nextflow.log*
The UPHL-BioNGS/walkercreek pipeline comes with documentation about the pipeline usage and output.
UPHL-BioNGS/walkercreek was originally written by Tom Iverson @tives82.
If you would like to contribute to this pipeline, please see the contributing guidelines. Contributions are welcome!
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.