- Basic exploration: extraction, reformatting, sanity checks (notebook)
- Quality control & transformations quantile-quantile plots
- Protein data (notebook)
- RNAseq data
- Unsupervised exploration (PCA, hierarchical clustering) and correlations
- Protein data (notebook)
- notes on PCA with prcomp and factoextra after the PCA discussion meeting
- RNAseq data (notebook)
- Protein data (notebook)
- Basic exploration: extraction, reformatting, sanity checks (notebook)
- Derived variables (e.g. age, survival) and correlations (e.g. CD4) notebook
- derived variables
- which clinical variables can be used as covariates, which should be used (or, can be considered as) outcomes
- descriptive statistics to summarize characteristics of the studied cohort
- simple correlations
These analyses are proposed to be carried out in parallel to the PLS study. I was aiming to chose techniques which are either established as standard in analysis of expression data, and/or easy to apply (so that the main focus is still on the PLS).
- Differential expression (DE) (1 day)
- Survival analysis:
- Gene set enrichment analysis (GSEA)
- Multiple regressions/ANOVA using top DEGs
- Summary of findings:
- protein data
- RNA-seq
*) initial DESeq2 performed by Dr Rachel, I re-analysed it with limma-voom toolset and different normalization proceduers
- Joint NMF clustering: a simple approach for unsupervised clustering
- Joint pathways analysis (late integration): using combined GSEA results (ActivePathways + Cytoscape)
- Correlations analysis (notebook)
- Review of available PLS implementations and recent developments
- Layman summary of PLS method with a graphical representation
- Reading up on scaling, de-noising, transformations and normalization
- Application to the analyzed omics data
Comparison of patient groups (by condition) (notebook)
- single-omic (O)PLS-DA discrimination analysis (X = [protein | RNA], Y=patient group)
- multi-omics (O)PLS-DA discrimination analysis (X = [protein, RNA], Y=patient group)
Due to missing data, we may be a need to perform a few analyses:
-
comprehensive "high-level" analysis with the clinical variables available for all patients (e.g. survival, age, CSF color, CSF cloudiness, HIV status),
-
subgroup-specific analyses, including the variables which were measured only for these patients:
- HIV positive patients, additional variables: CD4, ARV, UnARV
- patients with history of TB, additional variables: PrevTBForm, PrevTBTreat, OnTBTreat, (potentially also: DateTBTreat and DateTBTreatStop, though these are not useful on their own as are mutually exclusive; however when combined with other dates it might of some use)
-
O-PLS: single-omic regression (X=[protein | RNA], Y=clinical outcomes) -
O-PLS: multi-omics regression (X=[protein, RNA], Y=clinical outcomes)
- O2-PLS (X=protein, Y=RNA), O2 because the relation in either way is equally interesting. (Bylesjo et al, 2007) has an example of such analysis (supervised by J. Trygg).
- Compare to the joint NMF results
- Try applying alternative methods, two or more of:
- PARADIGM, iCluster - promising approaches
- CCA - similar to PLS
- COCA - very basic approach Beside comparing the
- As so far this is essentially a gene expression study, we could look at R svapls package
The original proposed research plan from the project description as a reference:
- Explore and QC the multi‐omics and clinical data using univariate and multivariate statistical tools. Initially each data set will be explored alone. Data will be checked for outliers and expected characteristics. Both univariate (e.g. linear models) and multivariate (e.g. PCA, PLS) approaches will be used.
- Integrate the multi‐omics data using multivariate techniques such as Orthogonal Partial Least Squares regression. Both conventional and O2PLS will be explored to integrate the omics data sets. Further QC analysis on the joint data will be conducted. Common sources of variance will be identified.
- The integrated multi‐omics data will be combined with the clinical parameters in a second stage data fusion, again using PLS and O2PLS techniques. We will investigate the ability of the individual and integrated data sets to predict clinical outcomes. Predictive models will be interrogated to elucidate proteins & transcripts linked to the disease diagnosis.
- If time allows we will investigate the use of OnPLS, a multiblock extension of O2PLS which allows to integrate data from more than two blocks. Results from this work will form the basis of future large‐scale clinical studies to further advance the diagnosis of infectious meningitides and understanding its pathogenesis. The ultimate aim is to derive signatures that can be validated in other trials and applicable for future clinical use.