Analysis Pipeline

Input: A user query in the form of country, lineage, drug OR a range of country, lineage, drugs to run pipeline on Output: SNP alignment file,.tree files for the respective input from BEAST with node ages, and bootstrap consensus tree

Running The Pipeline

python run_pipeline.py

To insert your query you have to open this python script and do the following:

In the countries variable put a list of the countries that you want (A full list of supported countries is already included)
In the lineages variable put a list of the lineages that you want (A full list of supported lineages is aready included)
In the drugs variable put a list of the drugs that you want (A full list of supported drugs is already included)

Then when you run the script it will create a seperate sbatch call for each country-lineage-drug combo. Specifically it will put the actual script sent to the cluster in cluster_scripts. The log file in logs. Once the script finishes executing you can find the final output in the output folder. All files associated with a specific country-lineage-drug combo will be called countrylineagedrug.

NOTE: You know that the script has finished by looking at the location of the countrylineagedrug file. While the script is running the countrylineagedrug file will be in the analysis_pipeline folder, but once the script has finished this folder will move to the output folder. You could also look at the log out/err files in the log file to see the status of any one of your launched scripts.

Pipeline Flow

Folders Included In The Pipeline

The analysis_pipeline folder come with various folders inside. PLEASE do not move any of these folders as the runner relies on their location. Here are the folders you should see:

bin: Pipeline critical scripts (A detailed explanation of the scripts will follow)
logs: The log files of the jobs you have launched via run_pipeline.py
output: The output of the jobs you have launched once finished
cluster_scripts: Scripts submitted to the cluster per your query

Scripts in Analysis_Pipeline

Other than run_pipeline.py you should find analysis_pipeline.sh. This file is called by run_pipeline.py many times for each country-lineage-drug combination and contains the meat of the pipeline as seen in the pipeline flow illustration. What follows below is a detailed explanation as to what each script in the pipeline does

Filters

In order to make sure our results were accurate we included various filters in run_pipeline.py to filter out potentially bogus country-lineage-drug combos. They are labeled in the run_pipeline.py file so they can be changed but as of now they are:

Makes sure there are at least 10 strains that fit the user query
Makes sure there are at least 20% resistant and 20% susceptible strains

Producing SNP Alignment

This part of the pipeline takes the query data and outputs a fasta file of snps of the user query to be used in downstream analysis

accessdatabase.py: Takes in the strain_info.tsv file that contains all the info on all strains (generated by another script geography.py) and the query and outputs a list of strains that match the query or says that no strains exist for the given query

python accessdatabase.py <geography.py output> <country> <lineage> <drug>

get_metadata.py: Takes in the strains.txt file generated by the AccessDatabase.py script and the strain_info.tsv file and outputs a tab-deliminated file where one column is strain and the other 1 or 0 if the strain is resistant or susceptible (respectively) to the drug of interest. This is useful info for downstream analysis.

python get_metadata.py

yasha_depth_organizer.py: Takes in strains.txt file and then goes into the rollingDB/genomic_data/ directory and access all the depth files of interest and concatenates them to one larger depth file (low_coverage.bed) based on your inputted tolerances

python yasha_depth_organizer.py <file with strains of interest> <Coverage Threshold> <Sequence Threshold> <Reference Genome Name>

fix_bed.py: Takes in low_coverage.bed and a summary text file full of strand information and creates a low_coverage.bed file with strand information

python fix_bed.py <bed file> <summary file>

The commands after this script simply combine the output of fix_bed.py and another bed file full of regions in the DNA that we do not want to look at and then organizes all this into a valid bed file

Applying Diversity filter

In addition to the filters mentioned above there is a genomic diversity filter. Genomic Diveristy of a group of strains is here defined as the average pairwise difference of all strains in the SNP alignment. Using another script I approximated the distribution of genomic diversity calculating a mean and std. This part of the pipeline makes sure that the genomic diversity of the genomic-lineage-drug is within 1 SD of the mean genomic diversity before moving on.

distribution.py: Takes in a SNP alignment and calculates the genomic diveristy

python distribution.py <path to SNP alignment>

Prepping BEAST Input File

This part of the pipeline takes the fasta file outputed by the previous part and creates the file to be used by BEAST

Raxml to find a starting tree with the reference strain as the outgroup to put in the starting tree part of our xml file
Megacc to test the relaxed and strict clock hypothesis to figure out which clock to use
fasta_to_beast.py: Takes in the concenante.fasta file, the tree outputed by RaxML, and a template file (two different ones depending which hypothesis you decide to go with) and outputs an input file to use with BEAST

python fasta_to_beast.py <SNP Alignment File> <RaxML tree> <Template File>

Running BEAST

This part of the pipeline runs BEAST. We basically ran BEAST repeatedly on different iterations until the ESS values became greater than 200.

beast_parameter_tester.py: Takes in a list of ESS values and outputs True if all ESS values are greater than 200, else it outputs the number to which to divide 200 with to find the factor to multiply the iterations with to hopefully get a more optimal ESS value

python beast_parameter_tester.py <Path to input.xml file of BEAST> <Path to file with ESS values>

Interpreting BEAST Results

This part we took the tree output and did bootstrap analysis on the output. We used Treeannotator to get the final tree from BEAST and RaxML to run the bootstrap analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
README.md		README.md
analysis_pipeline.sh		analysis_pipeline.sh
flow.png		flow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis Pipeline

Running The Pipeline

Pipeline Flow

Folders Included In The Pipeline

Scripts in Analysis_Pipeline

Filters

Producing SNP Alignment

Applying Diversity filter

Prepping BEAST Input File

Running BEAST

Interpreting BEAST Results

About

Uh oh!

Releases

Packages

Languages

farhat-lab/analysis_pipeline

Folders and files

Latest commit

History

Repository files navigation

Analysis Pipeline

Running The Pipeline

Pipeline Flow

Folders Included In The Pipeline

Scripts in Analysis_Pipeline

Filters

Producing SNP Alignment

Applying Diversity filter

Prepping BEAST Input File

Running BEAST

Interpreting BEAST Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages