A snakemake pipeline for variant calling from P. falciparum short amplicon reads

Motivation

PS: The pipeline is still at it's infancy stage

We sequenced an Illumina sequencing library on the Oxford Nanopore MinION (ONT) to evaluate the cost of this approach.

PCR amplicons from Plasmodium falciparum drug resistance markers (ama1, k13, dhps, dhfr and mdr1) were generated in duplicate.
Illumina sequencing libraries were generated using KAPA reagents and KAPA indexes.
Finally, ONT sequence libraries were generated using just one set of ONT adapters and sequenced on the ONT using the Flow Cell R9.4.1.
Hence, we cannot demultiplex the sequences into individual samples and further analyses were done at the population level.

Below are the project dependencies:

Package management

conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux.

Workflow management

snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.

Bioinformatics tools (packages)

fastqc - a tool for a quality control tool for high throughput sequence data
multiqc - a tool for aggregating bioinformatics analysis reports across many samples and tools
porechop - a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.
cutadapt - at tool that finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
bwa - an aligner for short-read alignment (see minimap2 for long-read alignment)
bedtools - allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF
bcftools - a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
snpEff - a genetic variant annotation and effect prediction toolbox
SnpSift - a toolbox that allows you to filter and manipulate annotated files.

Where to start

Clone this project into your computer using Git (installation instructions) with the following command:
- git clone https://github.com/kevin-wamae/PlasmoSeq-DualTech.git
Navigate into the cloned directory using the following command:
- cd PlasmoSeq-DualTech

Directory structure

Below is the default directory structure:
- config/ - contains the workflow configuration files
- env/ - contains the Conda environment files
- input/ - contains fastq, adaptors and genome files
- output/ - contains the output from the analysis
- workflow/ - contains the Snakemake script (snakefile) and additonal scripts

.
├── LICENSE
├── README.md
├── config
│ └── config.yaml
├── env
│ └── environment.yml
├── input
│ ├── 01_fastq
│ │ ├── file-1_0.fastq.gz
│ │ └── file-2_1.fastq.gz
│ ├── 02_adapters
│ │ ├── illumina-TruSeq-adapters.fasta
│ │ └── illumina-indexes.txt
│ └── 03_genome
│     ├── genome.fasta
│     └── genome_annotations.gff
├── output
└── workflow
    ├── scripts
    │ └── create_snpeff_db.sh
    └── snakefile

Running the analysis

Install conda and execute the following commands:

1 - Create the conda analysis environment and install the dependencies from the env/environment.yml by running the following command in your terminal:

conda env create --file env/environment.yml

2 - Activate the conda environment:

PS - This needs to be done every time you want to execute this pipeline:
conda activate ampseq-analysis

3 - Create the snpEff database by executing the bash script below. This script will download P. falciparum genome files from PlasmoDB and create and a snpEff database:

PS - for this analysis, we will use genome data release-51 from PlasmoDB, and we only need to run it once:
bash workflow/scripts/create_snpeff_db.sh

4 - Finally, execute the whole Snakemake pipeline by running the following command in your terminal:

PS - Replace 4 in the command with the number of CPUs you wish to use
snakemake -c4

5 - Alternatively, you can execute a specific rule by running the following command in your terminal:

PS - Replace rule in the command with respective rule-name from the workflow/Snakefile
snakemake -c4 rule (for example snakemake -c4 qc_raw_files)

Expected output

Below is the expected directory structure of the output/ directory:

01_snpeff_database/ - contains the snpEff database for variant calling
02_qc_raw/ - contains the fastqc QC reports from the raw fastq files
03_multiqc_raw/ - contains the aggregated fastqc QC reports
04_trim_fastq_ont/ - contains fastq files after trimming ONT adaptors
05_trim_fastq_illumina/ - contains fastq files after trimming Illumina adaptors
06_qc_trimmed_files/ - contains the fastqc QC reports from the fastq files after quality trimming
07_read_mapping/ - contains genome mapping files (index, bam and bed)
08_variant_calling/ - contains variant calling files

output/
├── 01_snpeff_database
│   ├── P.falciparum
│   └── genomes
├── 02_qc_raw
├── 03_multiqc_raw
│   └── multiqc_data
├── 04_trim_fastq_ont
├── 05_trim_fastq_illumina
├── 06_qc_trimmed_filesmed
├── 07_read_mapping
│   └── genomeIndex
└── 08_variant_calling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A snakemake pipeline for variant calling from P. falciparum short amplicon reads

Motivation

Below are the project dependencies:

Package management

Workflow management

Bioinformatics tools (packages)

Where to start

Directory structure

Running the analysis

Expected output

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
env		env
input		input
output		output
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

kevin-wamae/PlasmoSeq-DualTech

Folders and files

Latest commit

History

Repository files navigation

A snakemake pipeline for variant calling from P. falciparum short amplicon reads

Motivation

Below are the project dependencies:

Package management

Workflow management

Bioinformatics tools (packages)

Where to start

Directory structure

Running the analysis

Expected output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages