Authors: Ha Nguyen, Van-Dung Pham, Hung Nguyen, Bang Tran, Nicole Schrad, Juli Petereit, and Tin Nguyen
This cloud-based learning module teaches pathway analysis, a set of tools and techniques used in life sciences research to discover the biological mechanisms behind a condition from high-throughput biological data. Pathway analysis tools are primarily used to analyze omics datasets to detect relevant groups of genes that are altered in case samples compared to a control group. Pathway analysis approaches use existing pathway databases and gene expression data to identify pathways significantly impacted in a given condition.
This module will cost approximately $1.00 to run, assuming you shut down and delete all resources at the end of your analysis.
Watch this Introduction Video to learn more about the module.
The course is structured in five submodules, allowing us to:
- Download and process data from public repositories.
- Perform differential analysis.
- Perform pathway analysis using different methods to answer different research hypotheses.
- Perform meta-analysis and combine methods and datasets to find consensus results.
- Interactively explore significantly impacted pathways across multiple analyses and browse relationships between pathways and genes.
The course content is organized in R Jupyter Notebooks. Another way to view this module is as a Jupyter Book, a package that combines individual Jupyter Notebooks into a web interface for better navigation. This is only for viewing the notebooks, not running them. Details on installing the tools and formatting the content can be found at: https://jupyterbook.org/en/stable/intro.html. The course content is located in the GitHub repository of Dr. Tin Nguyen's lab: https://github.com/tinnlab/NOSI-Google-Cloud-Training.
The overall structure of the modules is explained below:
- Submodule 01 describes how to obtain data from public repositories, process and save the expression matrix, and map probe IDs to gene symbols.
- Submodule 02 focuses on differential expression analysis using
limma
,t-test
,edgeR
, andDESeq2
. - Submodule 03 introduces common curated biological databases such as Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and REACTOME.
- Submodule 04 aims at performing enrichment analysis using popular methods such as
ORA
,FGSEA
, andGSA
. - Submodule 05 aims at performing meta-analysis using multiple datasets.
This learning module requires some computational hardware and local environment setup. The browser-based development environment is provided by Google and AWS. However, users need a Google email/Amazon account, sufficient internet access, and a standard web browser (e.g., Chrome, Edge, Firefox; Chrome is recommended) to create a Cloud Virtual Machine/SageMaker Notebook for analysis. It is recommended to execute the Jupyter Notebook using an R kernel (version > 4.1) on a standard machine with a minimum configuration of 4 vCPUs, 15 GB RAM, and 10 GB of HDD.
The following are the R and tool versions used during development:
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GSA_1.03.2 fgsea_1.24.0 BiocManager_1.30.20
[4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[7] dplyr_1.1.0 purrr_1.0.1 readr_2.1.4
[10] tidyr_1.3.0 tibble_3.2.0 ggplot2_3.4.1
[13] tidyverse_2.0.0 hgu133plus2.db_3.13.0 org.Hs.eg.db_3.16.0
[16] AnnotationDbi_1.60.2 IRanges_2.32.0 S4Vectors_0.36.2
[19] Biobase_2.58.0 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] httr_1.4.5 bit64_4.0.5 vroom_1.6.1
[4] jsonlite_1.8.4 blob_1.2.3 GenomeInfoDbData_1.2.9
[7] pillar_1.8.1 RSQLite_2.3.0 lattice_0.20-45
[10] glue_1.6.2 uuid_1.1-0 digest_0.6.31
[13] XVector_0.38.0 colorspace_2.1-0 cowplot_1.1.1
[16] htmltools_0.5.4 Matrix_1.5-1 pkgconfig_2.0.3
[19] zlibbioc_1.44.0 scales_1.2.1 tzdb_0.3.0
[22] BiocParallel_1.32.6 timechange_0.2.0 KEGGREST_1.38.0
[25] farver_2.1.1 generics_0.1.3 ellipsis_0.3.2
[28] cachem_1.0.7 withr_2.5.0 repr_1.1.6
[31] cli_3.6.0 magrittr_2.0.3 crayon_1.5.2
[34] memoise_2.0.1 evaluate_0.20 fansi_1.0.4
[37] tools_4.2.2 data.table_1.14.8 hms_1.1.2
[40] lifecycle_1.0.3 munsell_0.5.0 Biostrings_2.66.0
[43] compiler_4.2.2 GenomeInfoDb_1.34.9 rlang_1.0.6
[46] grid_4.2.2 RCurl_1.98-1.10 pbdZMQ_0.3-9
[49] IRkernel_1.3.2 labeling_0.4.2 bitops_1.0-7
[52] base64enc_0.1-3 gtable_0.3.1 codetools_0.2-18
[55] DBI_1.1.3 R6_2.5.1 fastmap_1.1.1
[58] bit_4.0.5 utf8_1.2.3 fastmatch_1.1-3
[61] stringi_1.7.12 parallel_4.2.2 IRdisplay_1.1
[64] Rcpp_1.0.10 vctrs_0.5.2 png_0.1-8
[67] tidyselect_1.2.0
All data used in the modules were downloaded from the Gene Expression Omnibus (GEO) repository using accession number GSE48350. The data were originally generated by Berchtold and Cotman (2013). We preprocessed and normalized this data before use in subsequent analyses.
Some common errors include:
- The Jupyter Notebook kernel defaulting to Python, preventing libraries from loading properly. To fix this, check the upper-right corner of the edit ribbon to ensure it says "R". If not, click the text next to the circle (O) to change the kernel.
- Problems loading a library. Ensure the package is properly installed.
gsutil
command issues. Try using thesystem
function in R to run bash commands:system("gsutil or bash command", intern= TRUE)
.- Errors creating a bucket, possibly due to an existing bucket with the same name. Choose a unique bucket name.
- Package installation issues. Follow the instructions in the documentation.
- Errors due to capitalization or spelling mistakes.
This work was fully supported by NIH NIGMS under grant number GM103440. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License