The pipeline calls variants on Illumina paired end(PE) / single end (SE) reads provided in a directory and generates various combinations of core/non-core consensus fasta files that can be used for phylogenetic reconstruction or as an input for Gubbins/BEAST analysis.
- Installation
- Input
- Steps
- Command line options
- Run pipeline on compute cluster
- Quick start
- Output files
- Customizing the config file
- Log
- Bonus Ducks
The dependencies are already installed in Snitkin lab bin_group folder:
/nfs/esnitkin/bin_group/variant_calling_bin/
Requires Python2:
/nfs/esnitkin/bin_group/anaconda2/bin/python
- readsdir: folder containing SE/PE reads. Apart from the standard Miseq/Hiseq fastq naming convention (R1_001_final.fastq.gz), other acceptable fastq extensions are:
- R1.fastq.gz/_R1.fastq.gz,
- 1_combine.fastq.gz,
- 1_sequence.fastq.gz,
- _forward.fastq.gz,
- _1.fastq.gz/.1.fastq.gz.
- config: a config file to set pipeline configuration settings such as setting up environment path for various tools, path to reference genomes and filter parameters. The config file is a YAML format file that stores data in KEY: VALUE pair. This settings will be applied globally on all variant call jobs. An example config file with default parameters is included in code folder. You can customize this config file and provide it with the -config argument. An example parameter setting is shown below where we are setting the bin directory path. This is another way of telling the pipeline that all the tools required for variant calling are located in "binbase" directory of bin_path section.
[bin_path]
binbase: /nfs/esnitkin/bin_group/variant_calling_bin/
- index: a reference genome index name as specified in a config file. For example; if you have set the reference genome path in config file as shown below, then the required value for command line argument -index would be -index KPNIH1
[KPNIH1]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1/
# Name of reference genome fasta file.
Ref_Name: KPNIH1.fasta
Here, Ref_Name is the reference genome fasta file located in Ref_Path. Similarly, if you want to use a different version of KPNIH reference genome, you can create a new section with a different index name.
[KPNIH1_new]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1_new/
# Name of reference genome fasta file.
Ref_Name: KPNIH1_new.fasta
For more information, refer to Customizing the config file.
The pipeline is divided into two individual steps (-steps option) which should be run in sequential order (see below for command line arguments).
1. Variant Calling: This step will run all standard variant calling steps on sample reads residing in the input reads directory.
The possible options are:
Option All: This will run all variant calling steps from read trimming to variant calling.
Option clean,align,post-align,varcall,filter,stats: Use these options, if you want to run individual steps involved in variant calling steps. This will run variant calling steps starting from read cleaning/trimming, alignment to reference genome, post-alignment sam/bam format conversion, variant calling/filtering and finally generates mapping/variant statistics.
You can run a part of the pipeline by customizing the order of the -steps argument. For example, to skip the Trimmomatic cleaning part and instead run the pipeline from alignment step, run it with the following options:
"align,post-align,varcall,filter,stats"
Note: The order of variant calling steps needs to be sequential. If skipping any of the steps, make sure the previous steps finished without any errors and their results are already present in output folder.
2. Generate various Core/Non-core SNP consensus, SNP/Indel Matrices and recombination filtering with Gubbins:
Option core_All: This will run all the core SNP consensus/matrix generating steps. Once the core variants and matrices are generated, the command will run gubbins and RAxML/iqtree jobs.
Note: Samples not meeting minimum Depth of Coverage threshold (10X) will be removed before running this step.
Some of the types of results generated by this step are:
- prefix_core_results directory under the output directory which will be the final results folder.
- intermediate data files required for generating core SNP matrix/consensus in vcf/fasta format.
- Various data matrices will be generated during this step that can be used for diagnosing variant filter criterias and their impact on overall distribution of core variants.
- Gubbins recombination filtered consensus fasta files and RaxML/Iqtree trees generated from this recombination filtered consensus.
usage: variant_call.py [-h] -type TYPE -readsdir DIR -outdir OUTPUT_FOLDER
-index INDEX [-steps STEPS] -analysis ANALYSIS_NAME
[-config CONFIG] [-suffix SUFFIX]
[-filenames FILENAMES] [-cluster CLUSTER] [-clean]
[-extract_unmapped EXTRACT_UNMAPPED] [-datadir DATADIR]
[-snpeff_db SNPEFF_DB] [-debug_mode DEBUG_MODE]
[-gubbins GUBBINS]
Variant Calling pipeline for Illumina PE/SE data.
optional arguments:
-h, --help show this help message and exit
Required arguments:
-type TYPE Type of reads: SE or PE
-readsdir DIR Path to Sequencing Reads Data directory. NOTE: Provide full/absolute path.
-outdir OUTPUT_FOLDER
Output Folder Path ending with output directory name to save the results. Creates a new output directory path if it doesn't exist. NOTE: Provide full/absolute path.
-index INDEX Reference Index Name. Most Frequently used reference genomes index options: KPNIH1 | MRSA_USA_300 | MRSA_USA_100 | CDIFF_630 | paris Make sure the paths are properly set in config file
-steps STEPS Variant Calling Steps in sequential order.
1. All: Run all variant calling steps starting from trimming the reads, mapping, post-processing the alignments and calling variants;
2. core_All: Extract core snps and generate different types of alignments, SNP/Indel Matrices and diagnostics plots to explore filtered snps.
-analysis ANALYSIS_NAME
Unique analysis name that will be used as prefix to saving results and log files.
Optional arguments:
-config CONFIG Path to Config file, Make sure to check config settings before running pipeline
-suffix SUFFIX Fastq reads suffix such as fastq, fastq.gz, fq.gz, fq; Default: fastq.gz
-filenames FILENAMES fastq filenames with one single-end filename per line.
If the type is set to PE, it will detect the second paired-end filename with the suffix from first filename.
Useful for running variant calling pipeline on selected files in a reads directory or extracting core snps for selected samples in input reads directory.
Otherwise the pipeline will consider all the samples available in reads directory.
-cluster CLUSTER Run variant calling pipeline in one of the four modes. Default: local. Suggested mode for core snp is "cluster" that will run all the steps in parallel with the available cores. Make sure to provide a large memory node for this option
The possible modes are: cluster/parallel-local/local
Set your specific hpc cluster parameters in config file under the [scheduler] section. Supports only PBS scheduling system.
-clean clean up intermediate files. Default: OFF
-extract_unmapped EXTRACT_UNMAPPED
Extract unmapped reads, assemble it and detect AMR genes using ariba
-datadir DATADIR Path to snpEff data directory
-snpeff_db SNPEFF_DB Name of pre-build snpEff database to use for Annotation
-debug_mode DEBUG_MODE
yes/no for debug mode
-gubbins GUBBINS yes/no for running gubbins
Variant calling can be run in parallel on each sample using the -cluster argument. Set pbs resources such as resources (-l), email (-M), queue (-q), flux_account (-A) and notification (-m) under the [scheduler] section in config file.
For more details, refer to the UMICH flux website for a detailed explanation of each pbs specification.
Possible options for the -cluster option (Supported system: pbs):
cluster: The pipeline is optimized for this option. When this option is set, the pipeline will run variant call jobs for each sample on individual small compute clusters or on a single large cluster for core snp generation step.
local: This is the default option. When this option is set, the pipeline will analyze each sample one after the another. This will not make use of multiple cores present in your system or clusters. Use this option for small numbers of samples or for testing purposes.
parallel-local: This option will run the pipeline and analyze samples in parallel but on a local system. This is preferred for an input sample size of less than 20 and a multiple core local system.
Assuming you want to generate core snps for more than a few hundred samples and run the analysis in parallel on cluster (time and memory efficient). The default pbs resources used for parallel jobs are:
nodes=1:ppn=4,pmem=4000mb,walltime=24:00:00
See option resources in the scheduler section of the config file. Detailed information is in the section Customizing config file
- Run variant calling step (All) on a set of PE reads with default parameters
python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps All -cluster cluster
The above command will run the variant calling (step 1) pipeline on a set of PE reads residing in test_readsdir. The results will be saved in the output directory test_output_core. The config file contains options for some frequently used reference genomes. To know which reference genomes are included in the config file, look up the config file or check the help menu of the pipeline.
The results of variant calling will be placed in an individual folder generated for each sample in the output directory. A log file for each sample will be generated and can be found in each sample folder inside the output directory. A single log file of this step will be generated in the main output directory. For more information on log file prefix and convention, please refer to the log section below.
- Generate core SNPs/Matrices from variant calling results
This step compare multiple files simultaneously and involves multiple I/O operations, therefore it is recommended to provide higher memory compute resources.
example:
nodes=1:ppn=12,mem=47000mb,walltime=250:00:00
Replace the resources option in the scheduler section of the config file with the above line before running the command.
python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps core_All -cluster cluster -gubbins yes
All the final results will be saved under date_time_core_results directory under the output folder.
2018_10_23_17_25_08_core_results
├── core_snp_consensus
├── data_matrix
└── gubbins
The subdirectories for each of the above directories will be arranged in following fashion:
core_snp_consensus The core_vcf folder under this directory contains annotated core vcf files that were used for generating core SNP consensus fasta results. Other folders contain different combination of core/non-core consensus fasta files for individual samples. The consensus file from these folders are concatenated to generate the multiple sequence alignment file which are then placed in gubbins folder and are used as an input for gubbins jobs.
gubbins contains different combinations of core/non-core multi-fasta alignments. Alignments with "gubbins.fa" in their extension will be used as an input for Gubbins. Once the gubbins jobs finished, the pipeline runs RaxML and Iqtree on gubbins generated recombination filtered consensus fasta files. Raxml and iqtree results generated using these files will be placed in raxml_results and iqtree_results folder respectively.
The two important MSA files that are used for gubbins/raxml/iqtree are genome_aln_w_ref_allele_gubbins.fa and genome_aln_w_alt_allele_unmapped_gubbins.fa
-
genome_aln_w_ref_allele_gubbins.fa: This is a core genome alignment relative to the reference genome generated from core variant positions (final core variants + reference alleles).
-
genome_aln_w_alt_allele_unmapped_gubbins.fa: This genome alignment cotains all the variants that were called against each sample. It contains all core variant positions, non-core variant positions will be substituted with a dash (denoting unmapped) and N’s for not meeting hard variant filter thresholds or falling under functional class filters. Non-core variants that met all filter thresholds will be substituted with variant allele.
The different types of variants and the symbols that are used for each variant positions are described below:
- If a position is unmapped in a sample, then it will be denoted by '-'
- If a position did not meet a hard filter criteria in a sample then it will be denoted by 'N' in that particular sample.
- If a position falls under any of the functional class filter such as phage region, repeat region or custom mask region, then it will be denoted by 'N'
- If a position did not meet hard filter criteria such as FQ or MQ in any one sample, then these position will be masked entirely in all the samples.
- core_var_aln: alignment made up of only core variant positions.
data_matrix This folder contains different types of data matrices and reports that can be queried for variant diagnostics/QC plots.
-
matrices/SNP_matrix_allele.csv and Indel_matrix_allele.csv: contain allele information for each unique variant position (rownames) called in each individual sample (columns). Rownames are the position where a variant was called in one of the given samples and its associated annotations
-
matrices/SNP_matrix_allele_new.csv will also contain allele information for each unique variant position (rownames) called in each individual sample (columns) except the positions that were unmapped and filtered out will be substituted with dash (-) and N's.
-
matrices/SNP_matrix_code.csv and Indel_matrix_code.csv: contain one of the seven status codes for each unique variant position (rownames) called in individual sample (columns).
-
Functional_annotation_results/phage_region_positions.txt contains positions identified by Phaster as phage regions.
-
Functional_annotation_results/repeat_region_positions.txt contains positions identified by nucmer as tandem repeats.
-
Functional_annotation_results/mask_positions.txt contains positions set by the user to mask from the final core position list.
-
Functional_annotation_results/Functional_class_filter_positions.txt is an aggregated unique list of positions that fall under repeat, mask and phage region (REPEAT_MASK_PHAGE).
-
Row annotation: The rows in the matrix represent a variant position and its associated annotation divided into two parts seperated by a semi colon. An example row annotation is shown below:
Type of SNP at POS > ALT functional=PHAGE_REPEAT_MASK locus_tag=locus_id strand=strand;ALT|Effect|Impact|GeneID|Nrchange|Aachange|Nrgenepos|AAgenepos|gene_symbol|product Sample_name
Coding SNP at 31356 > T functional=NULL_NULL_NULL locus_tag=USA300HOU_0023 strand=+;T|stop_gained|HIGH|USA300HOU_0023|c.373C>T|p.Gln125*|373/2361|125/786|null or hypothetical protein|possible 5'-nucleotidase; 0
The first part contains information such as type of SNP, position, variant allele found, functional annotation, locus tag for the gene and strand. Second part contains snpEff annotation for the variant found and its impact on the gene.
The different status codes are:
Code | Description |
---|---|
-1 | unmapped |
0 | reference allele |
1 | core |
2 | filtered |
3 | non-core or true variant but filtered out due to another sample |
-2 | Phage Region |
-3 | FQ Region Masked |
-4 | MQ Region Masked |
Some toy examples of how codes are arranged for different type of variants and how they would be represented in allele matrix is shown below:
- Reference Allele:
- Unmapped Positions:
- Core Variants:
- Filtered Positions:
- Filtered Positions masked due to Phage/FQ/MQ filters:
Bargraph matrices
bargraph_counts.txt and bargraph_indel_counts.txt contain distributions of all the variant positions called in each individual sample. Each color represents a type of variant filter. The bar corresponds to the number of variants that got filtered out due to that particular hard filter parameter in each sample.
bargraph_indel_percentage.txt and bargraph_percentage.txt contain same distribution as the counts bargraphs but the counts are transformed into percentages.
These matrices can be used for QC checks and to understand the effects of different variant filters on the total number of core variants.
Note: Samples with an unusual bar height should be considered outlier samples.
Run the generate_diagnostics_plots.R script created inside the data_matrix folder to generate various QC plots.
Requires: ggplot2 and heatmap.3
module load R
Rscript generate_diagnostics_plots.R
File | Description |
---|---|
barplot.pdf | Distribution of filter-pass variant positions (variants observed in all the samples) in each sample. Colors represent the filter criteria that caused them to get filtered out in that particular sample. |
barplot_DP.pdf | Distribution of filter-pass variant positions in each sample. Colors represent the read-depth range that they fall in. |
temp_Only_filtered_positions_for_closely_matrix_FQ.pdf | Heatmap spanning the reference genome that shows positions that were filtered out due to low FQ values |
DP_position_analysis.pdf | same information as in barplot_DP.pdf but shown in heatmap format |
temp_Only_filtered_positions_for_closely_matrix_DP.pdf | Heatmap spanning the reference genome that shows positions that were filtered out due to low DP values |
- barplot
- barplot_DP
- Run only part of a variant calling step
OPTIONAL: If you want to rerun the above analysis with different variant call/filter parameters (i.e skip the read cleaning, alignment and post-alignment steps), you can do it as follows:
NOTE: OPTIONAL
python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps varcall,filter,stats -cluster parallel-cluster
- Run the variant calling steps on selected samples provided in filenames.txt. filenames.txt should contain fastq filenames with one single-end filename per line.
python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps varcall,filter,stats -cluster parallel-cluster -filenames filenames.txt
- Run core_prep and core steps on selected files
If you decide to exclude some samples from the core snp analysis step (step 3), there is a way to run core_prep and core steps on selected files without running the entire pipeline.
python /nfs/esnitkin/bin_group/pipeline/Github/variant_calling_pipeline/variant_call.py -type PE -readsdir /Path-To-Your/test_readsdir/ -outdir /Path/test_output_core/ -analysis output_prefix -index MRSA_USA_300 -steps core -cluster parallel-cluster -filenames filenames_custom
By default, the pipeline uses the config file that comes with the pipeline. Make sure to edit this config file or copy it to your local system, edit it and provide the path of this edited config file with the -config argument.
cp variant_calling_pipeline/config /Path-to-local/config_edit
The pipeline implements customisable variant calling configurations using the config file. The config file can be customised to use your choice of aligner and variant caller by changing two parameters under the section [pipeline] Currently, The pipeline supports BWA aligner (mem algorithm) for aligning reads to the reference genome and samtools for variant calling.
# Set which tools to use in pipeline:
[pipeline]
# Options for Aligner:bwa / smalt / bowtie
aligner: bwa
# Options for variant_caller: gatkhaplotypecaller /samtools
variant_caller: samtools
Make sure you have downloaded all the dependencies for the pipeline in a folder path provided in the binbase option of the [bin_path] section.
# Set bin folder path. Please make sure all the executables are placed in the bin folder. Also make sure the path for individual tools are correct.
[bin_path]
binbase: /nfs/esnitkin/bin_group/variant_calling_bin/
NOTE: Add the required perl libraries (such as in the case of vcftools) PERL5LIB environment variable. For flux users, you can do that by loading perl-modules
module load perl-modules
If you wish to run the jobs on a cluster, make sure you cahnge the necessary scheduler parameters in the scheduler section shown below: for more information, visit the flux homepage.
[scheduler]
resources: nodes=1:ppn=4,pmem=4000mb,walltime=24:00:00
email: username@umich.edu
queue: XXX
flux_account: XXX
notification: a
Every tool has its own *_bin option where you can set the folder name in which the tool resides. For example, in the below Trimmomatic section example, the Trimmomatic tool resides in the /Trimmomatic/ folder that is set with the trimmomatic_bin option which in itself resides in /nfs/esnitkin/bin_group/variant_calling_bin/ folder that was set in the binbase option above.
[Trimmomatic]
trimmomatic_bin: /Trimmomatic/
adaptor_filepath: adapters/TruSeq3-Nextera_PE_combined.fa
seed_mismatches: 2
palindrome_clipthreshold: 30
simple_clipthreshold: 10
minadapterlength: 8
keep_both_reads: true
window_size: 4
window_size_quality: 20
minlength: 40
headcrop_length: 0
colon: :
targetlength: 125
crop_length: 40
f_p: forward_paired.fq.gz
f_up: forward_unpaired.fq.gz
r_p: reverse_paired.fq.gz
r_up: reverse_unpaired.fq.gz
Parameters for each of the tools can be customised under the 'tool_parameter' attribute of each tool in the config file.
For example, to change the minadapterlength parameter of Trimmomatic from 8 to 10, replace minadapterlength of 8 with suppose 10 and restart the pipeline.
The pipeline generates a log file following the naming convention: yyyy_mm_dd_hrs_mins_secs_analysisname.log.txt and tracks each event/command. The log file sections follow standard Python logging conventions:
INFO to print STDOUT messages;
DEBUG to print commands run by the pipeline,
ERROR to print STDERR messages and
EXCEPTION to print an exception that occured while the pipeline was running.