Skip to content

Latest commit

 

History

History
171 lines (140 loc) · 4.73 KB

README.md

File metadata and controls

171 lines (140 loc) · 4.73 KB

GenTB-RandomForest tuberculosis resistance prediction pipeline in Snakemake

This repository contains the analysis pipeline used by GenTB to predict antibiotic resistance M. tuberculosis wrapped in Snakemake. The code for the GenTB webiste is described here.

Overview

The pipeline uses well established programs to align reads to a reference strain and call variants. Raw sequence reads, in the form of a fastq file, are aligned to H37Rv reference strain using minimap2. Samtools is employed for sorting of the alignment file and removal of duplicate reads. Variants are called with Pilon. The variants are then annotated and filtered using custom scripts and fed to two multivariate prediction models (GenTB-Random Forest and GenTB Wide and Deep Neural Network).

Note

This Snakemake implementation of GenTB-Random Forest currently does not include the Kraken and Check Depth steps.

Usage

  1. Clone this Github repository to your local machine
git clone https://github.com/farhat-lab/gentb-snakemake.git
  1. Download example fastq files:
#change directory into the repository folder, create a directory to save the fastq files, then download sample fastq files
cd gentb-snakemake/; mkdir data/fastq; cd data/fastq/

#download one set of paired-ended fastQ files
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR187/009/ERR1873539/ERR1873539_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR187/009/ERR1873539/ERR1873539_2.fastq.gz

#change back to main `gentb-snakemake` directory
cd ../../
  1. Make sure you have the package manager conda installed to source all software and dependencies into a new environment:
conda env create -f envs/snakemake-gentb.yml
conda activate snakemake-gentb
  1. Start the analysis pipeline in Snakemake
snakemake
  1. Start the analysis pipeline in Snakemake on a SLURM cluster:
snakemake -s Snakefile -j 100 --cluster-config cluster.json --cluster "sbatch --mem {cluster.mem} -t {cluster.t} -c {cluster.c} -p {cluster.p}"

Output description

The final output file <sampleID>.predict.json is printed in JSON format and can be parsed using R or python functions. The output consists of three parts.

  1. Drug resistance probabilities

A list of drug-lists where each drug-list has the following items:

["<sampleID>", 
 "<drug>", 
 "<binary resistance (1/0)>", 
 "<False Positive Rate>", 
 "<False Negative Rate>", 
 "<probability of resistance>"]

For the example of ERR1873539 and the drug ethambutol, this part of the output would look as follows:

["ERR1873539", 
 "emb", 
 "1", 
 "21.28", 
 "16.64", 
 "0.998"]
  1. Important drug resistance mutations

Next, a list of five lists of important drug resistance mutations (defined here). The items per list are ordered the same like the resistance probability lists above.

For the example of ERR1873539, this part of the output would look as follows:

"ERR1873539": [
[
"SNP_CN_2155168_C944G_S315T_katG", 
"SNP_CN_761110_A1304T_D435V_rpoB", 
"INS_CF_2288725_i517G_173G_pncA", 
"SNP_CN_4247431_G918A_M306I_embB", 
"SNP_N_1472359_A514C_rrs", 
"SNP_CN_4326333_C1141G_A381P_ethA", 
"SNP_N_1473246_A1401G_rrs", 
"SNP_N_1473246_A1401G_rrs", 
"SNP_N_1473246_A1401G_rrs", 
"SNP_CN_7582_A281G_D94G_gyrA", 
"SNP_CN_7582_A281G_D94G_gyrA", 
"SNP_CN_7582_A281G_D94G_gyrA", 
null
], 
[
"SNP_CN_4247431_G918A_M306I_embB", 
null, 
null, 
null, 
"SNP_N_1473246_A1401G_rrs", 
"SNP_P_1673423_G17T_promoter_fabG1.inhA", 
null, 
null, 
"SNP_N_1472359_A514C_rrs", 
null, 
null, 
null, 
null
], 
[
"SNP_P_1673423_G17T_promoter_fabG1.inhA", 
null, 
null, 
null, 
"SNP_CN_4407927_T276G_E92D_gid", 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null
], 
[
null, 
null, 
null, 
null, 
"SNP_CN_4407967_A236G_L79S_gid", 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null
], 
[
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null, 
null
]
]

  1. Last, another list of five list follows with other or novel variants that were not seen durinig training the Random Forest model but occurred in known resistance genes.

Tested on CentOS Linux Version 7