Skip to content

farhat-lab/analysis_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Analysis Pipeline

Input: A user query in the form of country, lineage, drug OR a range of country, lineage, drugs to run pipeline on Output: SNP alignment file,.tree files for the respective input from BEAST with node ages, and bootstrap consensus tree

Running The Pipeline

python run_pipeline.py 

To insert your query you have to open this python script and do the following:

  • In the countries variable put a list of the countries that you want (A full list of supported countries is already included)
  • In the lineages variable put a list of the lineages that you want (A full list of supported lineages is aready included)
  • In the drugs variable put a list of the drugs that you want (A full list of supported drugs is already included)

Then when you run the script it will create a seperate sbatch call for each country-lineage-drug combo. Specifically it will put the actual script sent to the cluster in cluster_scripts. The log file in logs. Once the script finishes executing you can find the final output in the output folder. All files associated with a specific country-lineage-drug combo will be called countrylineagedrug.

NOTE: You know that the script has finished by looking at the location of the countrylineagedrug file. While the script is running the countrylineagedrug file will be in the analysis_pipeline folder, but once the script has finished this folder will move to the output folder. You could also look at the log out/err files in the log file to see the status of any one of your launched scripts.

Pipeline Flow

alt text

Folders Included In The Pipeline

The analysis_pipeline folder come with various folders inside. PLEASE do not move any of these folders as the runner relies on their location. Here are the folders you should see:

  • bin: Pipeline critical scripts (A detailed explanation of the scripts will follow)
  • logs: The log files of the jobs you have launched via run_pipeline.py
  • output: The output of the jobs you have launched once finished
  • cluster_scripts: Scripts submitted to the cluster per your query

Scripts in Analysis_Pipeline

Other than run_pipeline.py you should find analysis_pipeline.sh. This file is called by run_pipeline.py many times for each country-lineage-drug combination and contains the meat of the pipeline as seen in the pipeline flow illustration. What follows below is a detailed explanation as to what each script in the pipeline does

Filters

In order to make sure our results were accurate we included various filters in run_pipeline.py to filter out potentially bogus country-lineage-drug combos. They are labeled in the run_pipeline.py file so they can be changed but as of now they are:

  1. Makes sure there are at least 10 strains that fit the user query
  2. Makes sure there are at least 20% resistant and 20% susceptible strains

Producing SNP Alignment

This part of the pipeline takes the query data and outputs a fasta file of snps of the user query to be used in downstream analysis

  • accessdatabase.py: Takes in the strain_info.tsv file that contains all the info on all strains (generated by another script geography.py) and the query and outputs a list of strains that match the query or says that no strains exist for the given query
python accessdatabase.py <geography.py output> <country> <lineage> <drug> 
  • get_metadata.py: Takes in the strains.txt file generated by the AccessDatabase.py script and the strain_info.tsv file and outputs a tab-deliminated file where one column is strain and the other 1 or 0 if the strain is resistant or susceptible (respectively) to the drug of interest. This is useful info for downstream analysis.
python get_metadata.py 
  • yasha_depth_organizer.py: Takes in strains.txt file and then goes into the rollingDB/genomic_data/ directory and access all the depth files of interest and concatenates them to one larger depth file (low_coverage.bed) based on your inputted tolerances
python yasha_depth_organizer.py <file with strains of interest> <Coverage Threshold> <Sequence Threshold> <Reference Genome Name> 
  • fix_bed.py: Takes in low_coverage.bed and a summary text file full of strand information and creates a low_coverage.bed file with strand information
python fix_bed.py <bed file> <summary file> 
  • The commands after this script simply combine the output of fix_bed.py and another bed file full of regions in the DNA that we do not want to look at and then organizes all this into a valid bed file

Applying Diversity filter

In addition to the filters mentioned above there is a genomic diversity filter. Genomic Diveristy of a group of strains is here defined as the average pairwise difference of all strains in the SNP alignment. Using another script I approximated the distribution of genomic diversity calculating a mean and std. This part of the pipeline makes sure that the genomic diversity of the genomic-lineage-drug is within 1 SD of the mean genomic diversity before moving on.

  • distribution.py: Takes in a SNP alignment and calculates the genomic diveristy
python distribution.py <path to SNP alignment> 

Prepping BEAST Input File

This part of the pipeline takes the fasta file outputed by the previous part and creates the file to be used by BEAST

  • Raxml to find a starting tree with the reference strain as the outgroup to put in the starting tree part of our xml file
  • Megacc to test the relaxed and strict clock hypothesis to figure out which clock to use
  • fasta_to_beast.py: Takes in the concenante.fasta file, the tree outputed by RaxML, and a template file (two different ones depending which hypothesis you decide to go with) and outputs an input file to use with BEAST
python fasta_to_beast.py <SNP Alignment File> <RaxML tree> <Template File>  

Running BEAST

This part of the pipeline runs BEAST. We basically ran BEAST repeatedly on different iterations until the ESS values became greater than 200.

  • beast_parameter_tester.py: Takes in a list of ESS values and outputs True if all ESS values are greater than 200, else it outputs the number to which to divide 200 with to find the factor to multiply the iterations with to hopefully get a more optimal ESS value
python beast_parameter_tester.py <Path to input.xml file of BEAST> <Path to file with ESS values>  

Interpreting BEAST Results

This part we took the tree output and did bootstrap analysis on the output. We used Treeannotator to get the final tree from BEAST and RaxML to run the bootstrap analysis.

About

Pipeline that takes user country-lineage-drug queries and output SNP alignment/BEAST tree output

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published