Cassia Wagner1,2,*, Kathryn E. Kistler 2,3, Garrett A. Perchetti 4, Noah Baker 4, Lauren A. Frisbie 5, Laura Marcela Torres 5, Frank Aragona 5, Cory Yun 5, Marlin Figgins 2,6, Alexander L. Greninger 2,4, Alex Cox 5, Hanna N. Oltean 5, Pavitra Roychoudhury 2,4, Trevor Bedford 1,2,3
1 Department of Genome Sciences, University of Washington, Seattle, WA, USA;
2 Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA, USA;
3 Howard Hughes Medical Institute, Seattle, WA, USA;
4 Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA;
5 Washington State Department of Health, Shoreline, Washington, USA;
6 Department of Applied Mathematics, University of Washington, Seattle, Washington, USA.
* Corresponding author: cassiasw@uw.edu
Knockout of the ORF8 protein has repeatedly spread through the global viral population during SARS-CoV-2 evolution. Here we use both regional and global pathogen sequencing to explore the selection pressures underlying its loss. In Washington State, we identified transmission clusters with ORF8 knockout throughout SARS-CoV-2 evolution, not just on novel, high fitness viral backbones. Indeed, ORF8 is truncated more frequently and knockouts circulate for longer than for any other gene. Using a global phylogeny, we find evidence of positive selection to explain this phenomenon: nonsense mutations resulting in shortened protein products occur more frequently and are associated with faster clade growth rates than synonymous mutations in ORF8. Loss of ORF8 is also associated with reduced clinical severity, highlighting the diverse clinical impacts of SARS-CoV-2 evolution.
This repository includes the code for the analyses and figures for the above manuscript.
Clinical data from Washington State Disease Reporting System is not included as this data is derived from confidential medical records.
GISAID metadata and sequenced used in the analysis may be accessed at gisaid.org/EPI_SET_230921by.
The SARS-CoV-2 UShER phylogeny is available from UShER.
code
contains the scripts for all analyses.data
contains simulated clinical data containing all variables used in severity analysis to check code quality. It also contains a subset of the clinical data variables, which we have permission to share. This folder also contains mutation annotations from Obermeyer et al. Please access the GISAID sequences and SARS-CoV-2 UShER phylogeny using the above links.nextstrain_build
contains the identified clusters and the configurations for building the nextstrain trees to identify transmission clusters of gene knockouts in Washington State.envs
contains the conda config files for python code & notebooks and for matUtils.notebooks
contains jupyter notebooks for plotting results and initial analyses.params
includes the SARS-CoV-2 reference genomes used in analyses & the config file for snakemake pipeline.usher
contains results from analyses using the usher phylogeny.intrahost
contains intrahost variants after filtering to remove samples that did not pass QC.
Use mamba to quickly (~5 min) install matUtils & python notebooks environments. The environment for python scripts & notebooks can be set up & activated using:
# Install
mamba env create -f envs/orf8ko.yaml
# Activate
mamba activate orf8ko
The environment for matUtils can be set up & activated using:
# Install
mamba env create -f envs/usher-env.yaml
# Activate
mamba activate usher-env
Rscripts were run in RStudio using R version 4.1.2. The R environment dependencies are listed in envs/renv.lock
. To use this environment:
# Install renv
install.packages("renv")
# Create and activate renvironment
renv::restore(lockfile = 'envs/renv.lock')
This process should take a few minutes.
- Run
code/find_ko.py
on .fasta alignment of WA sequences to call potential gene knockouts. See above to access sequences and metadata from GISAID. - Build and call transmission clusters using
nextstrain_build
- Run intrahost analysis using
notebooks/intrahost_analysis.ipynb
- Calculate dN/dS using the snakemake workflow:
code/dNdS_snakefile
. See above to download the UShER tree for this analysis. - Call mutation clusters from UShER tree using
code/getMutationClusters.py
- Model cluster growth rates using:
code/clusterSize_regression.R
- Run clade-level analyses using the snakemake workflow:
code/variant_snakefile
. See above to download the UshER tree for this analysis. code/combineClinicalData.R
is used to generate the dataframe for clinical analysis.- Use
code/Fig5.R
to run the clinical severity analysis. Although we cannot share the full clinical data to protect patient privacy, we have provideddata/clinical_example.tsv
as a demo dataset. We have also provided a subset of clinical variables, which we are able to share to while protecting patient privacy, atdata/clinical_subset.tsv
.