This repository contains code for proteome curation, phylogenomic inference, molecular conservation calculations, and analyses related to the pub "Leveraging evolution to identify novel organismal models of human biology".
This repository uses conda to manage software environments and package installation. You can find operating system-specific instructions for installing miniconda here.
After installing conda and mamba, you can now build the environment. Because the conservation analysis depends on several R packages not distributed through conda, as well as several packages that must be locally compiled from source, you must take two additional steps before building the environment. First, you must edit the environment YAML file, uncommenting the C/C++ compilers that are appropriate for your operating system. This section of the environment YAML file is shown below. Currently, a Unix-like environment is assumed, with Linux-specific compilers uncommented by default. If you are running on Mac, you'll need to comment the GCC
compilers, and uncomment those for clang
.
dependencies: # Comment and uncomment the relevant lines below based on your operating system.
- gcc_linux-64 # Linux (GCC C compiler)
- gxx_linux-64 # Linux (GCC C++ compiler)
# - clang_osx-64 # macOS (Clang C compiler)
# - clangxx_osx-64 # macOS (Clang C++ compiler)
Second, you must run additional build scripts after creating and activating the new conda environment with the appropriate compilers installed. Below, we provide code to carry out the whole process (after modifying the environment YAML file).
# Create the environment and activate it (after first editing the environment YAML file).
mamba env create -n aa_stats_mv_dists --file envs/aa_stats_mv_dists.yml
conda activate aa_stats_mv_dists
# Install the remaining dependencies within this conda environment:
bash install/install_pathd8.sh
bash install/install_treepl.sh
bash install/install_r_packages_for_aa_stats_mv_dists.sh
Before proceeding with any (re)analysis, first download the NovelTree run outputs from Zenodo here and decompress the outputs
# Download all data and results from Zenodo (note: this file is 13GB).
wget https://zenodo.org/records/14425432/files/2024-organismal-selection-zenodo.zip
# Extract these data:
unzip 2024-organismal-selection-zenodo.zip
# Navigate into the directory and extract the NovelTree run outputs for reanalysis:
cd 2024-organismal-selection-zenodo/
tar -xzvf results-noveltree-model-euks.tar.gz
The data hosted on zenodo, includes a directory (2024-organismal-selection-zenodo/
) containing the following:
run_configurations/noveltree-model-euks-samplesheet.csv
- the samplesheet for our snakemake preprocessing workflow to filter and preprocess species proteomes prior to analysis with NovelTree.run_configurations/euk_preprocess_samplesheet.tsv
&run_configurations/noveltree-model-euks-parameterfile.json
- the NovelTree sample and parameter files used to run NovelTree.preprocessed_proteomes.tar.gz
- a compressed tarball containing the preprocessed proteomes used by our NovelTree run.results-noveltree-model-euks.tar.gz
- a compressed tarball containing all outputs generated by our NovelTree run.aa-summary-stats.tar.gz
- a compressed tarball containing all AA summary statistics generated bycode/genefam_aa_summaries.py
.gf-aa-multivar-distances.tar.gz
- a compressed tarball containing all result files produced bycode/calc_protein_mv_distances.R
.organismal_selection_tool_citations.csv
- source citations describing available genetic perturbations for organisms in our portfolio.
With the NovelTree run outputs downloaded and extracted into the base directory of this repository, we now proceed by calling the script code/genefam_aa_summaries.py
. This bash script calculates for each protein sequence within each gene family, summaries of AA composition, as well as AA physical properties. All code below assumes that you have downloaded and extracted the directory 2024-organismal-selection-zenodo/
from this pubs correspoding Zenodo repository.
# Ensure we are calling this script within the correct conda environment
conda activate aa_stats_mv_dists
# Set the MSA directory to variable
msa_dir="2024-organismal-selection-zenodo/results-noveltree-model-euks/witch_alignments/original_alignments/"
# Now, run the script to calculate the physicochemical properties of each protein using ProtParam
python code/genefam_aa_summaries.py -t 10 $msa_dir
This will create a new directory called "aa-summary-stats/" that contains the calculated AA properties for each protein, and summarized for each gene famly. With these protein properties curated, we can now proceed with the calculation of pairwise multivariate distances between proteins within each gene family.
Rscript code/calc_protein_mv_distances.R
Briefly, this script:
- Reads in the species tree from the NovelTree run results and time-calibrates it using a species tree containing these species obtained from timetree.org
- Reads in species metadata from the NovelTree samplesheet and copy number information
- Reads in the gene family trees and protein properties calculated by
code/genefam_aa_summaries.py
, retaining only those gene families that contain human proteins, and then for each gene family, it:- Time-calibrates the gene family trees so branch lengths reflect time, rather than the extent of sequence divergence.
- Uses this tree to transforms the AA physical properties such that we correct for phylogenetic non-independence between proteins
- Calculate multivariate (mahalanobis) distances between proteins
Create the conda environment and install the remaining R packages:
mamba env create -n organismal-selection-analysis --file envs/analysis.yml
conda activate organismal-selection-analysis
Rscript install/install_r_packages_for_analysis.R
Next, load and organize the data:
Rscript code/org-sel-data.R
The code to recreate the analyses and figures from the pub is in the script code/org-sel-analysis.R
.
See how we recognize feedback and contributions to our code.