2024-organismal-selection

Purpose

This repository contains code for proteome curation, phylogenomic inference, molecular conservation calculations, and analyses related to the pub "Leveraging evolution to identify novel organismal models of human biology".

Installation and Setup

This repository uses conda to manage software environments and package installation. You can find operating system-specific instructions for installing miniconda here.

After installing conda and mamba, you can now build the environment. Because the conservation analysis depends on several R packages not distributed through conda, as well as several packages that must be locally compiled from source, you must take two additional steps before building the environment. First, you must edit the environment YAML file, uncommenting the C/C++ compilers that are appropriate for your operating system. This section of the environment YAML file is shown below. Currently, a Unix-like environment is assumed, with Linux-specific compilers uncommented by default. If you are running on Mac, you'll need to comment the GCC compilers, and uncomment those for clang.

dependencies: # Comment and uncomment the relevant lines below based on your operating system.
  - gcc_linux-64  # Linux (GCC C compiler)
  - gxx_linux-64  # Linux (GCC C++ compiler)
  # - clang_osx-64  # macOS (Clang C compiler)
  # - clangxx_osx-64  # macOS (Clang C++ compiler)

Second, you must run additional build scripts after creating and activating the new conda environment with the appropriate compilers installed. Below, we provide code to carry out the whole process (after modifying the environment YAML file).

# Create the environment and activate it (after first editing the environment YAML file).
mamba env create -n aa_stats_mv_dists --file envs/aa_stats_mv_dists.yml
conda activate aa_stats_mv_dists

# Install the remaining dependencies within this conda environment:
bash install/install_pathd8.sh
bash install/install_treepl.sh
bash install/install_r_packages_for_aa_stats_mv_dists.sh

Data

Before proceeding with any (re)analysis, first download the NovelTree run outputs from Zenodo here and decompress the outputs

# Download all data and results from Zenodo (note: this file is 13GB).
wget https://zenodo.org/records/14425432/files/2024-organismal-selection-zenodo.zip

# Extract these data:
unzip 2024-organismal-selection-zenodo.zip

# Navigate into the directory and extract the NovelTree run outputs for reanalysis:
cd 2024-organismal-selection-zenodo/
tar -xzvf results-noveltree-model-euks.tar.gz

The data hosted on zenodo, includes a directory (2024-organismal-selection-zenodo/) containing the following:

run_configurations/noveltree-model-euks-samplesheet.csv - the samplesheet for our snakemake preprocessing workflow to filter and preprocess species proteomes prior to analysis with NovelTree.
run_configurations/euk_preprocess_samplesheet.tsv & run_configurations/noveltree-model-euks-parameterfile.json - the NovelTree sample and parameter files used to run NovelTree.
preprocessed_proteomes.tar.gz - a compressed tarball containing the preprocessed proteomes used by our NovelTree run.
results-noveltree-model-euks.tar.gz - a compressed tarball containing all outputs generated by our NovelTree run.
aa-summary-stats.tar.gz - a compressed tarball containing all AA summary statistics generated by code/genefam_aa_summaries.py.
gf-aa-multivar-distances.tar.gz - a compressed tarball containing all result files produced by code/calc_protein_mv_distances.R.
organismal_selection_tool_citations.csv - source citations describing available genetic perturbations for organisms in our portfolio.

Usage

With the NovelTree run outputs downloaded and extracted into the base directory of this repository, we now proceed by calling the script code/genefam_aa_summaries.py. This bash script calculates for each protein sequence within each gene family, summaries of AA composition, as well as AA physical properties. All code below assumes that you have downloaded and extracted the directory 2024-organismal-selection-zenodo/ from this pubs correspoding Zenodo repository.

# Ensure we are calling this script within the correct conda environment
conda activate aa_stats_mv_dists

# Set the MSA directory to variable
msa_dir="2024-organismal-selection-zenodo/results-noveltree-model-euks/witch_alignments/original_alignments/"

# Now, run the script to calculate the physicochemical properties of each protein using ProtParam
python code/genefam_aa_summaries.py -t 10 $msa_dir

This will create a new directory called "aa-summary-stats/" that contains the calculated AA properties for each protein, and summarized for each gene famly. With these protein properties curated, we can now proceed with the calculation of pairwise multivariate distances between proteins within each gene family.

Rscript code/calc_protein_mv_distances.R

Briefly, this script:

Reads in the species tree from the NovelTree run results and time-calibrates it using a species tree containing these species obtained from timetree.org
Reads in species metadata from the NovelTree samplesheet and copy number information
Reads in the gene family trees and protein properties calculated by code/genefam_aa_summaries.py, retaining only those gene families that contain human proteins, and then for each gene family, it:
- Time-calibrates the gene family trees so branch lengths reflect time, rather than the extent of sequence divergence.
- Uses this tree to transforms the AA physical properties such that we correct for phylogenetic non-independence between proteins
- Calculate multivariate (mahalanobis) distances between proteins

Replicating the analyses of molecular conservation in the pub

Create the conda environment and install the remaining R packages:

mamba env create -n organismal-selection-analysis --file envs/analysis.yml

conda activate organismal-selection-analysis

Rscript install/install_r_packages_for_analysis.R

Next, load and organize the data:

Rscript code/org-sel-data.R

The code to recreate the analyses and figures from the pub is in the script code/org-sel-analysis.R.

Contributing

See how we recognize feedback and contributions to our code.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
code		code
data		data
envs		envs
install		install
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

2024-organismal-selection

Purpose

Installation and Setup

Data

Usage

Replicating the analyses of molecular conservation in the pub

Contributing

About

Uh oh!

Releases 2

Packages

Contributors 4

Uh oh!

Languages

License

Arcadia-Science/2024-organismal-selection

Folders and files

Latest commit

History

Repository files navigation

2024-organismal-selection

Purpose

Installation and Setup

Data

Usage

Replicating the analyses of molecular conservation in the pub

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Uh oh!

Languages

Packages