Skip to content

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

License

Notifications You must be signed in to change notification settings

blab/cartography

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Sravani Nanduri1, Allison Black2, Trevor Bedford2,3, John Huddleston2,4

  1. Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
  2. Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
  3. Howard Hughes Medical Institute, Seattle, WA, USA
  4. Corresponding author (jhuddles@fredhutch.org)

Preprint: https://doi.org/10.1101/2024.02.07.579374

Abstract

Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

Phylogenetic trees and embeddings

Explore the phylogenetic trees and embeddings on Nextstrain.

Interactive figures

Main figures

Supplemental figures

Supplemental tables

Full analysis

Installation

First, install Conda with the Miniconda distribution. Until Bioconda supports modern Mac CPUs, Mac users with M1/M2 CPUs (the ARM64 architecture) need to install the Mac Intel x86 Miniconda distribution and install Rosetta, so the workflow can run under Mac's emulation mode.

After installing Conda, create the environment for this project.

conda env create -f cartography.yml

Activate the environment prior to running the workflow below.

conda activate cartography

Next, you need to install Julia and then install TreeKnit following the instructions to install the "CLI" version. The TreeKnit binary installs in your home directory, by default, in the path ~/.julia/bin/treeknit. This path is what the project's workflow calls to run TreeKnit.

Notes for Windows users

If you are a Windows user, you will need to install WSL to run this project's workflow. You cannot put this github repository in the Users file. Snakemake sees /U as a unicodeescape error and will not run, so please make a folder outside of the Users folder (ex. directly in the C drive) where you install this github repository, anaconda, and all other dependencies.

Run the full analysis

Run the full analysis for the project which includes simulations, analysis of natural populations, and generation of the manuscript and its figures and tables. Use the following command to run the analysis on a single compute node (e.g., a local laptop, single cluster node through an interactive shell, etc.).

snakemake --profile profiles/local

Use the following command to run the analysis on a SLURM cluster, submitting no more than 20 jobs at a time.

snakemake -j 20 --profile profiles/slurm

This is a complex workflow, so it will take several hours to run.

About

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •