Variant Calling and core SNP diagnostics Pipeline

Synopsis

The pipeline calls variants on Illumina paired end(PE) / single end (SE) reads provided in a directory and generates various combinations of core/non-core consensus fasta files that can be used for phylogenetic reconstruction or as an input for Gubbins/BEAST analysis.

Installation

The dependencies are already installed in Snitkin lab bin_group folder:

/nfs/esnitkin/bin_group/variant_calling_bin/

Requires Python2:

/nfs/esnitkin/bin_group/anaconda2/bin/python

Input Requirements

readsdir: folder containing SE/PE reads. Apart from the standard Miseq/Hiseq fastq naming convention (R1_001_final.fastq.gz), other acceptable fastq extensions are:


- R1.fastq.gz/_R1.fastq.gz, 
- 1_combine.fastq.gz, 
- 1_sequence.fastq.gz, 
- _forward.fastq.gz, 
- _1.fastq.gz/.1.fastq.gz.

config: a config file to set pipeline configuration settings such as setting up environment path for various tools, path to reference genomes and filter parameters. The config file is a YAML format file that stores data in KEY: VALUE pair. This settings will be applied globally on all variant call jobs. An example config file with default parameters is included in code folder. You can customize this config file and provide it with the -config argument. An example parameter setting is shown below where we are setting the bin directory path. This is another way of telling the pipeline that all the tools required for variant calling are located in "binbase" directory of bin_path section.

[bin_path]
binbase: /nfs/esnitkin/bin_group/variant_calling_bin/

index: a reference genome index name as specified in a config file. For example; if you have set the reference genome path in config file as shown below, then the required value for command line argument -index would be -index KPNIH1

[KPNIH1]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1/
# Name of reference genome fasta file.
Ref_Name: KPNIH1.fasta

Here, Ref_Name is the reference genome fasta file located in Ref_Path. Similarly, if you want to use a different version of KPNIH reference genome, you can create a new section with a different index name.

[KPNIH1_new]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1_new/
# Name of reference genome fasta file.
Ref_Name: KPNIH1_new.fasta

For more information, refer to Customizing the config file.

Steps

The pipeline is divided into two individual steps (-steps option) which should be run in sequential order (see below for command line arguments).

1. Variant Calling: This step will run all standard variant calling steps on sample reads residing in the input reads directory.

The possible options are:

Option All: This will run all variant calling steps from read trimming to variant calling.

Option clean,align,post-align,varcall,filter,stats: Use these options, if you want to run individual steps involved in variant calling steps. This will run variant calling steps starting from read cleaning/trimming, alignment to reference genome, post-alignment sam/bam format conversion, variant calling/filtering and finally generates mapping/variant statistics.

You can run a part of the pipeline by customizing the order of the -steps argument. For example, to skip the Trimmomatic cleaning part and instead run the pipeline from alignment step, run it with the following options:

"align,post-align,varcall,filter,stats"

Note: The order of variant calling steps needs to be sequential. If skipping any of the steps, make sure the previous steps finished without any errors and their results are already present in output folder.

2. Generate various Core/Non-core SNP consensus, SNP/Indel Matrices and recombination filtering with Gubbins:

Option core_All: This will run all the core SNP consensus/matrix generating steps. Once the core variants and matrices are generated, the command will run gubbins and RAxML/iqtree jobs.

Note: Samples not meeting minimum Depth of Coverage threshold (10X) will be removed before running this step.

Some of the types of results generated by this step are:

prefix_core_results directory under the output directory which will be the final results folder.
intermediate data files required for generating core SNP matrix/consensus in vcf/fasta format.
Various data matrices will be generated during this step that can be used for diagnosing variant filter criterias and their impact on overall distribution of core variants.
Gubbins recombination filtered consensus fasta files and RaxML/Iqtree trees generated from this recombination filtered consensus.

Command line options


usage: variant_call.py [-h] -type TYPE -readsdir DIR -outdir OUTPUT_FOLDER
                       -index INDEX [-steps STEPS] -analysis ANALYSIS_NAME
                       [-config CONFIG] [-suffix SUFFIX]
                       [-filenames FILENAMES] [-cluster CLUSTER] [-clean]
                       [-extract_unmapped EXTRACT_UNMAPPED] [-datadir DATADIR]
                       [-snpeff_db SNPEFF_DB] [-debug_mode DEBUG_MODE]
                       [-gubbins GUBBINS]

Variant Calling pipeline for Illumina PE/SE data.

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -type TYPE            Type of reads: SE or PE
  -readsdir DIR         Path to Sequencing Reads Data directory. NOTE: Provide full/absolute path.
  -outdir OUTPUT_FOLDER
                        Output Folder Path ending with output directory name to save the results. Creates a new output directory path if it doesn't exist. NOTE: Provide full/absolute path.
  -index INDEX          Reference Index Name. Most Frequently used reference genomes index options: KPNIH1 | MRSA_USA_300 | MRSA_USA_100 | CDIFF_630 | paris Make sure the paths are properly set in config file
  -steps STEPS          Variant Calling Steps in sequential order.
                        1.   All: Run all variant calling  steps starting from trimming the reads, mapping, post-processing the alignments and calling variants;
                        2.   core_All: Extract core snps and generate different types of alignments, SNP/Indel Matrices and diagnostics plots to explore filtered snps.
  -analysis ANALYSIS_NAME
                        Unique analysis name that will be used as prefix to saving results and log files.

Optional arguments:
  -config CONFIG        Path to Config file, Make sure to check config settings before running pipeline
  -suffix SUFFIX        Fastq reads suffix such as fastq, fastq.gz, fq.gz, fq; Default: fastq.gz
  -filenames FILENAMES  fastq filenames with one single-end filename per line. 
                        If the type is set to PE, it will detect the second paired-end filename with the suffix from first filename. 
                        Useful for running variant calling pipeline on selected files in a reads directory or extracting core snps for selected samples in input reads directory. 
                        Otherwise the pipeline will consider all the samples available in reads directory.
  -cluster CLUSTER      Run variant calling pipeline in one of the four modes. Default: local. Suggested mode for core snp is "cluster" that will run all the steps in parallel with the available cores. Make sure to provide a large memory node for this option
                        The possible modes are: cluster/parallel-local/local
                        Set your specific hpc cluster parameters in config file under the [scheduler] section. Supports only PBS scheduling system. 
  -clean                clean up intermediate files. Default: OFF
  -extract_unmapped EXTRACT_UNMAPPED
                        Extract unmapped reads, assemble it and detect AMR genes using ariba
  -datadir DATADIR      Path to snpEff data directory
  -snpeff_db SNPEFF_DB  Name of pre-build snpEff database to use for Annotation
  -debug_mode DEBUG_MODE
                        yes/no for debug mode
  -gubbins GUBBINS      yes/no for running gubbins

Run pipeline on compute cluster

Variant calling can be run in parallel on each sample using the -cluster argument. Set pbs resources such as resources (-l), email (-M), queue (-q), flux_account (-A) and notification (-m) under the [scheduler] section in config file.

For more details, refer to the UMICH flux website for a detailed explanation of each pbs specification.

Possible options for the -cluster option (Supported system: pbs):

cluster: The pipeline is optimized for this option. When this option is set, the pipeline will run variant call jobs for each sample on individual small compute clusters or on a single large cluster for core snp generation step.

local: This is the default option. When this option is set, the pipeline will analyze each sample one after the another. This will not make use of multiple cores present in your system or clusters. Use this option for small numbers of samples or for testing purposes.

parallel-local: This option will run the pipeline and analyze samples in parallel but on a local system. This is preferred for an input sample size of less than 20 and a multiple core local system.

Quick Start

Assuming you want to generate core snps for more than a few hundred samples and run the analysis in parallel on cluster (time and memory efficient). The default pbs resources used for parallel jobs are:

nodes=1:ppn=4,pmem=4000mb,walltime=24:00:00

See option resources in the scheduler section of the config file. Detailed information is in the section Customizing config file

Run variant calling step (All) on a set of PE reads with default parameters

python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps All -cluster cluster

The above command will run the variant calling (step 1) pipeline on a set of PE reads residing in test_readsdir. The results will be saved in the output directory test_output_core. The config file contains options for some frequently used reference genomes. To know which reference genomes are included in the config file, look up the config file or check the help menu of the pipeline.

The results of variant calling will be placed in an individual folder generated for each sample in the output directory. A log file for each sample will be generated and can be found in each sample folder inside the output directory. A single log file of this step will be generated in the main output directory. For more information on log file prefix and convention, please refer to the log section below.

Generate core SNPs/Matrices from variant calling results

This step compare multiple files simultaneously and involves multiple I/O operations, therefore it is recommended to provide higher memory compute resources.

example:

nodes=1:ppn=12,mem=47000mb,walltime=250:00:00

Replace the resources option in the scheduler section of the config file with the above line before running the command.

python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps core_All -cluster cluster -gubbins yes

Output Files

All the final results will be saved under date_time_core_results directory under the output folder.


2018_10_23_17_25_08_core_results
├── core_snp_consensus
├── data_matrix
└── gubbins

The subdirectories for each of the above directories will be arranged in following fashion:

core_snp_consensus The core_vcf folder under this directory contains annotated core vcf files that were used for generating core SNP consensus fasta results. Other folders contain different combination of core/non-core consensus fasta files for individual samples. The consensus file from these folders are concatenated to generate the multiple sequence alignment file which are then placed in gubbins folder and are used as an input for gubbins jobs.

gubbins contains different combinations of core/non-core multi-fasta alignments. Alignments with "gubbins.fa" in their extension will be used as an input for Gubbins. Once the gubbins jobs finished, the pipeline runs RaxML and Iqtree on gubbins generated recombination filtered consensus fasta files. Raxml and iqtree results generated using these files will be placed in raxml_results and iqtree_results folder respectively.

The two important MSA files that are used for gubbins/raxml/iqtree are genome_aln_w_ref_allele_gubbins.fa and genome_aln_w_alt_allele_unmapped_gubbins.fa

genome_aln_w_ref_allele_gubbins.fa: This is a core genome alignment relative to the reference genome generated from core variant positions (final core variants + reference alleles).
genome_aln_w_alt_allele_unmapped_gubbins.fa: This genome alignment cotains all the variants that were called against each sample. It contains all core variant positions, non-core variant positions will be substituted with a dash (denoting unmapped) and N’s for not meeting hard variant filter thresholds or falling under functional class filters. Non-core variants that met all filter thresholds will be substituted with variant allele.

The different types of variants and the symbols that are used for each variant positions are described below:

If a position is unmapped in a sample, then it will be denoted by '-'
If a position did not meet a hard filter criteria in a sample then it will be denoted by 'N' in that particular sample.
If a position falls under any of the functional class filter such as phage region, repeat region or custom mask region, then it will be denoted by 'N'
If a position did not meet hard filter criteria such as FQ or MQ in any one sample, then these position will be masked entirely in all the samples.

core_var_aln: alignment made up of only core variant positions.

data_matrix This folder contains different types of data matrices and reports that can be queried for variant diagnostics/QC plots.

matrices/SNP_matrix_allele.csv and Indel_matrix_allele.csv: contain allele information for each unique variant position (rownames) called in each individual sample (columns). Rownames are the position where a variant was called in one of the given samples and its associated annotations
matrices/SNP_matrix_allele_new.csv will also contain allele information for each unique variant position (rownames) called in each individual sample (columns) except the positions that were unmapped and filtered out will be substituted with dash (-) and N's.
matrices/SNP_matrix_code.csv and Indel_matrix_code.csv: contain one of the seven status codes for each unique variant position (rownames) called in individual sample (columns).
Functional_annotation_results/phage_region_positions.txt contains positions identified by Phaster as phage regions.
Functional_annotation_results/repeat_region_positions.txt contains positions identified by nucmer as tandem repeats.
Functional_annotation_results/mask_positions.txt contains positions set by the user to mask from the final core position list.
Functional_annotation_results/Functional_class_filter_positions.txt is an aggregated unique list of positions that fall under repeat, mask and phage region (REPEAT_MASK_PHAGE).
Row annotation: The rows in the matrix represent a variant position and its associated annotation divided into two parts seperated by a semi colon. An example row annotation is shown below:


Type of SNP at POS > ALT functional=PHAGE_REPEAT_MASK locus_tag=locus_id strand=strand;ALT|Effect|Impact|GeneID|Nrchange|Aachange|Nrgenepos|AAgenepos|gene_symbol|product	Sample_name

Coding SNP at 31356 > T functional=NULL_NULL_NULL locus_tag=USA300HOU_0023 strand=+;T|stop_gained|HIGH|USA300HOU_0023|c.373C>T|p.Gln125*|373/2361|125/786|null or hypothetical protein|possible 5'-nucleotidase;	0

The first part contains information such as type of SNP, position, variant allele found, functional annotation, locus tag for the gene and strand. Second part contains snpEff annotation for the variant found and its impact on the gene.

The different status codes are:

Code	Description
-1	unmapped
0	reference allele
1	core
2	filtered
3	non-core or true variant but filtered out due to another sample
-2	Phage Region
-3	FQ Region Masked
-4	MQ Region Masked

Some toy examples of how codes are arranged for different type of variants and how they would be represented in allele matrix is shown below:

Reference Allele:

Unmapped Positions:

Core Variants:

Filtered Positions:

Filtered Positions masked due to Phage/FQ/MQ filters:

Bargraph matrices

bargraph_counts.txt and bargraph_indel_counts.txt contain distributions of all the variant positions called in each individual sample. Each color represents a type of variant filter. The bar corresponds to the number of variants that got filtered out due to that particular hard filter parameter in each sample.

bargraph_indel_percentage.txt and bargraph_percentage.txt contain same distribution as the counts bargraphs but the counts are transformed into percentages.

These matrices can be used for QC checks and to understand the effects of different variant filters on the total number of core variants.

Note: Samples with an unusual bar height should be considered outlier samples.

Run the generate_diagnostics_plots.R script created inside the data_matrix folder to generate various QC plots.

Requires: ggplot2 and heatmap.3


module load R

Rscript generate_diagnostics_plots.R

File	Description
barplot.pdf	Distribution of filter-pass variant positions (variants observed in all the samples) in each sample. Colors represent the filter criteria that caused them to get filtered out in that particular sample.
barplot_DP.pdf	Distribution of filter-pass variant positions in each sample. Colors represent the read-depth range that they fall in.
temp_Only_filtered_positions_for_closely_matrix_FQ.pdf	Heatmap spanning the reference genome that shows positions that were filtered out due to low FQ values
DP_position_analysis.pdf	same information as in barplot_DP.pdf but shown in heatmap format
temp_Only_filtered_positions_for_closely_matrix_DP.pdf	Heatmap spanning the reference genome that shows positions that were filtered out due to low DP values

barplot

barplot_DP

Tips and Tricks

Run only part of a variant calling step

OPTIONAL: If you want to rerun the above analysis with different variant call/filter parameters (i.e skip the read cleaning, alignment and post-alignment steps), you can do it as follows:

NOTE: OPTIONAL


python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps varcall,filter,stats -cluster parallel-cluster

Run the variant calling steps on selected samples provided in filenames.txt. filenames.txt should contain fastq filenames with one single-end filename per line.

python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps varcall,filter,stats -cluster parallel-cluster -filenames filenames.txt

Run core_prep and core steps on selected files

If you decide to exclude some samples from the core snp analysis step (step 3), there is a way to run core_prep and core steps on selected files without running the entire pipeline.


python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps core -cluster parallel-cluster -filenames filenames_custom

Customizing the config file

By default, the pipeline uses the config file that comes with the pipeline. Make sure to edit this config file or copy it to your local system, edit it and provide the path of this edited config file with the -config argument.


cp variant_calling_pipeline/config /Path-to-local/config_edit

The pipeline implements customisable variant calling configurations using the config file. The config file can be customised to use your choice of aligner and variant caller by changing two parameters under the section [pipeline] Currently, The pipeline supports BWA aligner (mem algorithm) for aligning reads to the reference genome and samtools for variant calling.

# Set which tools to use in pipeline:
[pipeline]
# Options for Aligner:bwa / smalt / bowtie
aligner: bwa
# Options for variant_caller:  gatkhaplotypecaller /samtools
variant_caller: samtools

Make sure you have downloaded all the dependencies for the pipeline in a folder path provided in the binbase option of the [bin_path] section.

# Set bin folder path. Please make sure all the executables are placed in the bin folder. Also make sure the path for individual tools are correct.
[bin_path]
binbase: /nfs/esnitkin/bin_group/variant_calling_bin/

NOTE: Add the required perl libraries (such as in the case of vcftools) PERL5LIB environment variable. For flux users, you can do that by loading perl-modules

module load perl-modules

If you wish to run the jobs on a cluster, make sure you cahnge the necessary scheduler parameters in the scheduler section shown below: for more information, visit the flux homepage.


[scheduler]
resources: nodes=1:ppn=4,pmem=4000mb,walltime=24:00:00
email: username@umich.edu
queue: XXX
flux_account: XXX
notification: a

Every tool has its own *_bin option where you can set the folder name in which the tool resides. For example, in the below Trimmomatic section example, the Trimmomatic tool resides in the /Trimmomatic/ folder that is set with the trimmomatic_bin option which in itself resides in /nfs/esnitkin/bin_group/variant_calling_bin/ folder that was set in the binbase option above.

[Trimmomatic]
trimmomatic_bin: /Trimmomatic/
adaptor_filepath: adapters/TruSeq3-Nextera_PE_combined.fa
seed_mismatches: 2
palindrome_clipthreshold: 30
simple_clipthreshold: 10
minadapterlength: 8
keep_both_reads: true
window_size: 4
window_size_quality: 20
minlength: 40
headcrop_length: 0
colon: :
targetlength: 125
crop_length: 40
f_p: forward_paired.fq.gz
f_up: forward_unpaired.fq.gz
r_p: reverse_paired.fq.gz
r_up: reverse_unpaired.fq.gz

Parameters for each of the tools can be customised under the 'tool_parameter' attribute of each tool in the config file.

For example, to change the minadapterlength parameter of Trimmomatic from 8 to 10, replace minadapterlength of 8 with suppose 10 and restart the pipeline.

Log:

The pipeline generates a log file following the naming convention: yyyy_mm_dd_hrs_mins_secs_analysisname.log.txt and tracks each event/command. The log file sections follow standard Python logging conventions:

INFO to print STDOUT messages;

DEBUG to print commands run by the pipeline,

ERROR to print STDERR messages and

EXCEPTION to print an exception that occured while the pipeline was running.

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
docs		docs
img		img
modules		modules
.gitignore		.gitignore
Demo - How to run Variant calling pipeline?.ipynb		Demo - How to run Variant calling pipeline?.ipynb
README.md		README.md
config		config
config_ali		config_ali
config_ali_backup		config_ali_backup
config_gatk		config_gatk
config_settings.py		config_settings.py
config_settings.pyc		config_settings.pyc
log_modules.py		log_modules.py
logging_subprocess.py		logging_subprocess.py
pipeline.png		pipeline.png
pipeline.py		pipeline.py
pipeline_All.png		pipeline_All.png
pipeline_core_All.png		pipeline_core_All.png
variant_call.py		variant_call.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant Calling and core SNP diagnostics Pipeline

Synopsis

Contents

Installation

Input Requirements

Steps

Command line options

Run pipeline on compute cluster

Quick Start

Output Files

Tips and Tricks

Customizing the config file

Log:

About

Releases

Packages

Languages

BioYLiu/variant_calling_pipeline

Folders and files

Latest commit

History

Repository files navigation

Variant Calling and core SNP diagnostics Pipeline

Synopsis

Contents

Installation

Input Requirements

Steps

Command line options

Run pipeline on compute cluster

Quick Start

Output Files

Tips and Tricks

Customizing the config file

Log:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages