Skip to content

Latest commit

 

History

History
165 lines (99 loc) · 8.84 KB

README.md

File metadata and controls

165 lines (99 loc) · 8.84 KB

GitHub license DOI

Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits

Tiago Azevedo, Giovanna Maria Dimitri, Pietro Lió, Eric R. Gamazon

image

This repository contains all the code necessary to run and further extend the experiments presented in the following paper accepted at npj Systems Biology and Applications: https://doi.org/10.1038/s41540-021-00186-6.

Abstract

Here, we performed a comprehensive intra-tissue and inter-tissue multilayer network analysis of the human transcriptome. We generated an atlas of communities in gene co-expression networks in 49 tissues (GTEx v8), evaluated their tissue specificity, and investigated their methodological implications. UMAP embeddings of gene expression from the communities (representing nearly 18% of all genes) robustly identified biologically-meaningful clusters. Notably, new gene expression data can be embedded into our algorithmically derived models to accelerate discoveries in high-dimensional molecular datasets and downstream diagnostic or prognostic applications. We demonstrate the generalisability of our approach through systematic testing in external genomic and transcriptomic datasets. Methodologically, prioritisation of the communities in a transcriptome-wide association study of the biomarker C-reactive protein (CRP) in 361,194 individuals in the UK Biobank identified genetically-determined expression changes associated with CRP and led to considerably improved performance. Furthermore, a deep learning framework applied to the communities in nearly 11,000 tumors profiled by The Cancer Genome Atlas across 33 different cancer types learned biologically-meaningful latent spaces, representing metastasis () and stemness (). Our study provides a rich genomic resource to catalyse research into inter-tissue regulatory mechanisms, and their downstream consequences on human disease.

Repository Structure

This repository contains all the scripts which were used in the paper. The number in each script's name (in the root of this repository) corresponds to the order in which they are run in the paper.

Each folder used in this repository is explained as follows:

  • meta_data: This folder includes some completementary files to GTEx necessary to run some experiments. Examples of such files include phenotype information, as well as information about conversion of gene names and identification of reactome pathways.

  • outputs: This folder contains the outputs of some of the numbered scripts. Some contain important information used in the paper, as it is, for example, the files output_02_01.txt, output_04_02.txt, and output_06_02.txt.

  • results: Some result files, like the communities identified from the Louvain algorithm.

  • svm_results: Files with the metrics resulting from the SVM predictions.

  • track_hub: Files in the format required for Track Hub

In the repository there are also some jupyter notebooks which we hope can help researchers in using our results in their own experiments, as well as improve the reproducibility of this paper:

  • 09_community_info.ipyb: Instructions on how to check information regarding each community characterisation, including generation of LaTeX code.

  • 10_reactomes_per_tissue.ipynb: Instructions on how to check information regarding which reactomes were able to predict each tissue.

  • 11_multiplex_enrichment.ipynb: Instructions on how to check the group of genes identified in each multiplex network.

  • 12_tcga.ipynb: The code used in the paper to analyse the TCGA dataset within the GTEx pipeline of the paper, as well as a targeted R code (12_01_correct_confounds_tcga.R) used to correct the data.

  • 13_plots_for_paper.ipynb: The code used to generate the plots from the paper.

  • 14_track_hub.ipynb: Code and explanations on how we generated the needed files for Track Hub

Installing Dependencies

These scripts were tested in a Linux Ubuntu 16.04 operating system, with environments created using Anaconda.

Python scripts

We include a working dependency file in python_environment.yml describing the exact dependencies used to run the python scripts. In order to install all the dependencies automatically with Anaconda, one can easily just run the following command in the terminal to create an Anaconda environment:

$ conda env create --force --file python_environment.yml
$ conda activate gtex-env

To summarise the python_environment.yml file, the main dependencies needed to run these scripts are:

  • gseapy 0.9.16
  • jupyterlab 1.1.4
  • matplotlib 3.1.0
  • networkx 2.4
  • numpy 1.17.3
  • pandas 1.0.1
  • python 3.7.5
  • scikit-learn 0.21.3
  • statsmodels 0.10.2
  • umap-learn 0.4.2
  • bctpy 0.5.0

R scripts

We used R to run the unsupervised correction package sva, which is briefly described in the paper. To make things easier, we also include the R dependencies in which these scripts were run, with Anaconda. Similarly to python, one can install them using the following commands:

$ conda env create --force --file r_environment.yml
$ conda activate r_env

After the R environment is created and activated, one should install the sva package as descibred in the original repository:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sva")

We decided to keep python and R scripts in separate environments to avoid dependency issues given they are distinct programming languages.

Data Requirements

To details on the data used, which cannot be publicly shared in this repository, please see the paper.

Running the scripts

In order to analyse and see our jupyter notebooks, one just needs to start the jupyter engine in the root of this repository:

$ jupyter lab --port=8895

This command will print a link in the local machine at port 8895, which can be accessed using a browser.

The python scripts in this repository are numbered, suggesting an order by which they should be executed. However, each python script contains at the beginning of the file a small documentation explaining what it does. To run python scripts from 01 to 04_02, and 06_02, one just needs to run the following command:

$ python -u PYTHON_FILE | tee outputs/output_file.txt

The previous command will run PYTHON_FILE and log the output of the script in outputs/output_file.txt. All the other python scripts expect one or two flags to be passed. Information about each flag can be seen in each parser.add_argument command in each file, which contains a small documentation of what it means. For example, python script 05_01 expects the flag --tissue_num; therefore, that flag needs to be passed when executing the script:

$ python -u 05_01_svms_communities.py --tissue_num NUM | tee outputs/output_05_01_NUM.txt

where NUM corresponds to the value for that flag.

The following command is an example of how to run an R script:

$ Rscript --no-save --no-restore --verbose 02_01_correct_confounds.R > outputs/output_02_01.txt 2>&1

The previous command will run the 02_01_correct_confounds.R script and log the output of the script in outputs/output_02_01.txt.

Other scripts

The scripts for the multilayer modeling approach to TWAS/PrediXcan (CRP in UKB) and Variational Autoencoder model (TCGA) are in this external repository.

Citing this work

To cite our work, we provide the following BibTeX:

@article{Azevedo2021,
  doi = {10.1038/s41540-021-00186-6},
  url = {https://doi.org/10.1038/s41540-021-00186-6},
  year = {2021},
  month = may,
  publisher = {Springer Science and Business Media {LLC}},
  volume = {7},
  number = {1},
  pages={1--13},
  author = {Tiago Azevedo and Giovanna Maria Dimitri and Pietro Li{\'{o}} and Eric R. Gamazon},
  title = {Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits},
  journal = {npj Systems Biology and Applications}
}

Questions?

If you run into any problem or have a question, please just open an issue.