-
Notifications
You must be signed in to change notification settings - Fork 0
/
systemsSerology_Advanced.Rmd
393 lines (276 loc) · 18.4 KB
/
systemsSerology_Advanced.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
title: "Systems Serology Machine Learning Workshop Advanced"
output:
html_document:
df_print: paged
---
## Getting started
Install the following packages before getting started (before the workshop):
* readxl: data import
* ggpubr: statistical test
* corrr: correlations
* ropls: multivariate analysis
* glmnet: feature selection
* DMwR: data imputation
* pheatmap: visualization
* ggplot2: visualization
* ggrepel: visualization
* RColorBrewer: color schemes for visualization
* igraph: network analysis
* ggraph: network visualization
* tidyverse: unnecessarily fancy variable handling
## Data import and inspection
The data is taken from the following publication: Lu et al., IFN-γ-independent immune markers of Mycobacterium tuberculosis exposure, Nature Medicine (2019).
1) First, we import the data using the function 'read_excel'.
Note: To learn more about individual functions and their usage, use the command ?function_name, e.g., ?read_excel.
2) To visualize the data we plot a heatmap using 'pheatmap'.
3) We need to define X (the measurements) and y (the output we want to predict) for the further analysis.
The column names (colnames) and row names (rownames) of X should be the features and sample IDs, respectively.
4) There are missing values in the dataset. We impute missing values with k nearest neighbors imputation using the package 'DMwR'. It is very important to evaluate how the imputation affects the model, however, this is beyond scope of this workshop.
5) The measurements are on different scales, to ensure comparison across different features we will standardize the data (Z-score).
```{r data import, fig.width = 20}
library(readxl)
library(pheatmap)
library(DMwR)
# Import the data from the excel sheet
data = read_excel("RSTR_data_NatureMedicine.xlsx")
head(data)
# Define X and y
X <- as.matrix(data[, 5:dim(data)[2]]) #Column 1:4 contains demographics and outcome
rownames(X) <- paste0("s", seq(nrow(X))) #generates sample ids as 's1', 's2', 's3', ...
y <- factor(data$dx)
names(y) <- rownames(X)
# We make a heatmap of the raw data
pheatmap(X)
# Data imputation for the missing vlaues.
X <- knnImputation(X)
y <- y[match(rownames(X), names(y))] #ensures that the order of samples in Y is same as in X
# Z-score the data to bring the measurements to the same scale (mean 0 and standard deviation 1 across samples)
X_zscore <- scale(X, center = TRUE, scale = TRUE)
# Display heatmap of the z-scored data indicating phenotypic outcome
annotation_row = data.frame(phenotype = y)
rownames(annotation_row) <- rownames(X)
pheatmap(X_zscore, annotation_row = annotation_row)
```
## Description of the data
Response variable:
The phenotype we want to predict is latent Mtb infection (LTB) or resistors (RSTR). The classification is based on the diagnosis.
Specifically, LTB and RSTR are defined in paper as following:
* LTB: A population is diagnosed based on evidence of immune sensitization to Mycobacterium tuberculosis (Mtb) antigens and the absence of clinical symptoms of tuberculosis (TB) or direct microbiologic evidence of disease.
* RSTR: a population of house hold contacts which were highly exposed to Mtb yet remained persistently negative in interferon-r (IFN-r) release assay (IGRA) and tuberculin skin tests (TST) over an average of 9.5 years of follow-up for each individual.
Features:
Different assays/tests are performed together with different antigens. The nomenclature is as follows:
[antigen abbreviation].[test abbreviation].[metric]
List of antigen abbreviations:
ppd - purified protein derivative, proteins extracted from Mtb used in diagnosis
ag85 - antigen 85, Mtb protein
esat6 and cfp10 - Mtb protein
lam - lipoarabinomannan from Mtb cell-wall
flu - Influenza virus peptides (respiratory pathogen)
pneumo - S. pneumoniae peptides (respiratory pathogen)
cmv - cytomegalo virus (Epstein Barr) (non-respiratory pathogen)
vzv - varicellazostervirus (non-respiratory pathogen)
rubella - rubella virus (non-respiratory pathogen)
tetanus - tetanus virus (non-respiratory pathogen)
groes - chaperone protein GroES
hspx - α-crystalline
cl - cardiolipin
ps - phosphatidylserine
b2gp1 - recombinant human apolipoprotein H/b2GP1
gp140 - HIV glycoprotein gp140
rsv - Respiratory syncytial virus
List of test abbreviations:
CD107a - Functional test of NK cell degranulation (CD107a)
IFNg - Functional test of NK cell degranulation (IFNg)
MIP1b - Functional test of NK cell degranulation (MIP1b)
ADNP - Functional test of Ab-dependent neutrophil phagocytosis
ADCP - Functional test of Ab-dependent cellular phagocytosis
IgG.elisa - IgG ELISA
IgG, IgM, IgG1, IgG2, IgG3, IgG4, IgA1, IgA2 - luminex FC subclasses and isotypes
IgG1_2 - ratio IgG1:IgG2
IgG1_3 - ratio IgG1:IgG3
IgG1_4 - ratio IgG1:IgG4
IgG2_3 - ratio IgG2:IgG3
IgG3_4 - ratio IgG3:IgG4
FcR.3AF - luminex FcR allele variant (3AF)
FcR.3AV - luminex FcR allele variant (3AV)
FcR.2AH - luminex FcR allele variant (2AH)
FcR.2AR - luminex FcR allele variant (3AR)
G0, G0*, G0B, G0F, G0FB,... - IgG glycans
List of metric abbreviations:
AUC - area under the curve calculated from median fluorescence intensities (MFIs) from three dilutions
5, 10, 100, 1000 - dilutions
Meta data:
The dataset also includes metadata about the subjects including age, sex, and BMI that we do not use in the analysis.
## Univariate analysis
In the following, we will visualize the IgG1 values for the individual classes across antigens. We annotate the significance using the Wilcox test with the library 'ggpubr'. Note: here, the p-values are not corrected for multiple hypothesis testing!
```{r univariate}
library(ggplot2)
library(ggpubr)
# Extract indices for features including the IgG AUCs
indFeatures <- which(grepl("IgG1.1000", colnames(X_zscore)))
# Reformat the data for the following plotting procedure
dfBox <- reshape2::melt(cbind(as.data.frame(X_zscore[,indFeatures]), class = y), id.var = "class")
colnames(dfBox) <- c("class", "features", "value")
ggplot(dfBox, aes(x = features, y = value, fill = class)) +
stat_boxplot(position = position_dodge(width = 0.75),
geom = 'errorbar', width = 0.2) + # boxplot
geom_boxplot(outlier.shape = NA) + # outliers are plotted below
geom_point(position = position_jitterdodge(), shape = 21, size = 1,
aes(fill = class), color = "black") + # add points for individual measurements
ylab("z-scored value") +
theme_classic() + # visualization style
theme(axis.text.x = element_text(angle = 40, hjust = 1)) + # rotate x-labels
stat_compare_means(aes(group = class), label = "p.signif", method = "wilcox.test") # comparison of groups using Wilcox test
```
## Multivariate analysis
In this part of the workshop, we are going to analyze the data in a multivariate fashion to find multivariate signatures that correlate with resistance. First, we visualize the high-dimensional data in a two-dimensional subspace using principal component analysis (PCA), an unsupervised machine learning technique (i.e. only using our data matrix X). Next, to find a subset of important features, we perform feature selection using Least Absolute Shrinkage and Selection Operator (LASSO). Finally, we build a partial least square discriminant analysis (PLS-DA) model to assess the relation between the features (X) and resistance (Y).
The PCA and PLS-DA will be performed using the package 'ropls'. A more detailed tutorial for the package can be found here: https://www.bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html.
### Analysis on the whole data set
#### Principal component analysis (PCA)
Dimension reduction techniques are applied to transform high-dimensional data in order to visualize it in a lower dimensional subspace.
For this, a popular approach is PCA, which can be performed using the function 'opls' from the package 'ropls'.
```{r PCA}
# Task 1: Perform PCA on the z-scored data
#(remember that you can check the documentation using '?opls')
#(remember to load the libraries)
# [..your code...]
# This will generate a figure with 4 plots and a result object
```
We will pay more attention to the two plots on the bottom.
The obtain a nicer score plot (including a color coding of y), we can use the internal ropls 'plot' function (use 'ropls::plot').
```{r PCA score plot}
# Task 2: Plot only the score plot with color coded y
# (you can find the documentation for opls plot using '?plot.opls')
# [..your code...]
```
There is a lot of overlap between RSTR and LTB, however, some of the difference appears to be explained by the second principal component (PC2). So, let's have a closer look at the loadings plot, which can be done with the same plotting routine as the scores plot adjusting the plotting options.
```{r PCA laoding plot}
# Task 3: Plot only the loading plot of the PCA
# [..your code...]
```
This plot shows how individual features influence the principal components, highlighting the features that affect the components the most. The coordinates of the features on the axis shows the weight by which they contribute to the principal component. For example, a feature (ppd.G2S1) which is on the very top of the loadings plot contributes a lot to the second principal component, which we associated with LTB. However, the large number of features and the weak separation between LTB and RSTR makes this figure a challenge to interpret.
#### Feature selection
To select the minimal number of individual features that capture the overall variation among the RSTR and LTB, we use penalized logistic regression with cross validation implemented in the function 'cv.glmnet' from the 'glmnet' package. We extract the set of features which have a non-zero coefficient.
```{r feature selection}
# Task 4: Perform feature selection with 10 fold cross validation (CV) using the 'cv.glmnet' function with squared-error as metric and binomial response type.
#(the LASSO arguments are described in ?glmnet and the cross validation arguments are described in ?cv.glmnet)
#(a more detailed description can be found at https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)
set.seed(1234) # Fix random seed for demonstration purposes
# [..your code...]
#You can now extract the coefficients from the trained model using the 'coef' function (it is described in ?coef.glmnet)
#By setting 's' to 'lambda.min' you will get coefficients for the features that was optimized using CV.
#Remove intercept and extract the feature names
#Store data from the selected features in a variable for further analysis
```
#### PCA on feature selected data
Repeat the PCA using only the selected features. What do you observe?
```{r PCA sel}
# Task 5: Run PCA using data from the selected features
# [..your code...]
```
Now that we focus on features that are able to explain the difference between RSTR and LTB there is much better separation between the groups in PCA, mostly on PC2.
Next, we can visualize the contribution of each features on distinguished PC2.
```{r PCA feature importance visualization}
PCA.loading <- res_pca2@loadingMN
df <- data.frame(VIP = PCA.loading[, 2], features = rownames(PCA.loading))
ggplot(data = df,
aes(x = reorder(features, abs(VIP)), y = VIP)) +
geom_bar(stat = "identity", color = "black") +
theme_classic() +
xlab("") +
ylab("PC2 loading scores") +
ggtitle("Feature Importance in PC2") +
coord_flip()
```
From the loading score in PC2, we can see that the CFP10/ESAT6 IG1 level is among the top features enriched among 'resisters'.
#### Partial least squares discriminant analysis (PLS-DA)
PCA is an unsupervised technique which does not include the y-value, i.e. resistance. We can include this information in the modeling by using a supervised technique, namely, partial least square discriminant analysis (PLS-DA). In contrast to PCA, it does not find the directions of highest variance of the data (X), but the directions of highest co-variance of features (X) and outcome (y). This model can also be obtained by the function 'opls' of the 'ropls' package.
Note: For the moment, do not pay attention to the output visualization. We will look closer at the corresponding parts in the following.
```{r PLS-DA1}
# Task 6: Run PLS-DA on data from the selected features
# [..your code...]
```
For visualization purposes, set the number of latent variables (components) to two and only show the score and loadings plots.
```{r PLS-DA2}
# Task 7: Set the number of components to 2 and only show the score and loadings plots.
# [..your code...]
```
From the bottom right plot, we observe that the model clearly separates the samples from RSTR and LTB.
We always need to validate the model's robustness on the dataset. Here, we simply divide the whole samples into two groups with the ratio of 50%. Then, the model is trained in one subset (train set) and tested on antother subset (validation set). More complex evaluation methods could be applied like k-fold cross-validation or leave-one-out cross-validation, which will not be covered here.
Warning: This only is a small part of the generally required model validation, since it does not include a validation of the feature selection and is performed only once (i.e., for one realization of the data split)! See Ackermann, Das, et al., Route of immunization defines multiple mechanisms of vaccine-mediated protection against SIV, Nature Medicine, 2019 for an example and more detailed description of how to validate the model (in a permutation test setting).
```{r PLS-DA2 Evaluation}
res_plsda_train <- opls(X_sel, y, predI = 2, subset = "odd") # here "odd" means evenly divide the whole samples into two groups.
ropls::plot(res_plsda_train,
typeVc = "x-score")
ropls::plot(res_plsda_train,
typeVc = "x-loading")
# performance on the train subset
trainVi <- getSubsetVi(res_plsda_train)
table(y[trainVi], fitted(res_plsda_train))
# performance on the test subset
table(y[-trainVi],
predict(res_plsda_train, X_sel[-trainVi, ]))
```
From the scores plot, we still can see that the model performs good on the classification. Furthermore, from the confusion matrix in the validation subset, we see that the model gives a good prediction. In whole, we are more confident that the model is robust on the classification of LTB and RSTR in the dataset.
##### VIP scores
To assess the importance of the features in the PLS-DA model, the variable importance in projection (VIP) score can be used. The VIP scores can be obtained using the function 'getVipVn' of the package 'ropls'.
For this, you can use the visualization of the correlation coefficients as a template.
```{r VIP scores subset}
# Task 8: Get VIP scores and visualize them
# (to vizualize the scores use ggplot or similar)
# [..your code...]
```
From the VIP scores, we observe that CFP10/ESAT6 IgG1 and PPD-specific features are important to distinguish the LTB and RSTR groups.
#### Co-correlates
The regularization for feature selection picks features which are predictive for the outcome. However, also other features which are highly correlated with the chosen features could be used instead of the selected features. To take this into account it is important to assess the co-correlates, i.e., the features which correlate with the selected features.
Here, we could visualize the correlation among different features using two ways: a correlation matrix and a correlation network. Specifically, only the selected features and features that have an absolute correlation coefficient greater than 0.7 to one of the selected features are visualized.
1. One way of (visually) assessing the correlation matrix is with a correlation heatmap.
```{r correlation heatmap}
# Bonus task: Find other ways to explore the correlations
library(RColorBrewer)
corrMat <- cor(X, method = "spearman")
# heatmap for full correlation matrix
pheatmap(corrMat, cluster_rows = FALSE, cluster_cols = FALSE,
color = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100)) # fix color scale to make interpretation easier
# keep only selected features and highly correlated features
colMax <- function(X) apply(X, 2, max)
corrMat <- corrMat[selected_features, ]
corrMat <- corrMat[, which(colMax(abs(corrMat)) < 1 & colMax(abs(corrMat)) > 0.7)]
# heatmap for reduced correlation matrix
pheatmap(corrMat, cluster_rows = FALSE, cluster_cols = FALSE,
color = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100),
breaks = seq(-1, 1, length = 100))
```
From this correlation heatmap, we could idenfity several groups of features with similar characteritics. For example, we see that ppd.G2S1 is negativly correlated with the ppd.IgG1:IgG2 ratio and ppd.IgG1, while CFP10/ESAT6.IgG1 is positively correlated with the ppd.IgG1:IgG2 ratio and ppd.IgG1.
2. Another way of assessing the co-correlates is in network form with nodes representing the features and edges representing the correlation coefficient. Networks can be plotted in R using 'ggraph'.
```{r cocorrelate network, fig.width = 10}
library(tidyverse)
library(corrr)
library(igraph)
library(ggraph)
library(ggrepel)
library(RColorBrewer)
# include meta data in analysis
Xmeta <- data[,2:dim(data)[2]]
tidyCors <- Xmeta %>%
correlate(method = "spearman") %>%
stretch()
graphCors <- tidyCors %>%
filter(abs(r) > 0.6 & (is.element(x, selected_features) | is.element(y, selected_features))) %>%
graph_from_data_frame(directed = FALSE)
layout <- create_layout(graphCors, layout = 'igraph', algorithm = 'nicely')
pltGraph <- ggraph(layout) +
geom_edge_link(aes(color = r), edge_width = 2) +
guides(edge_alpha = "none", edge_width = "none") +
scale_edge_colour_gradientn(limits = c(-1, 1),
colors = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100)) +
geom_node_point(shape = 21, color = "gray", fill = "white", size = 8, stroke = 0.5) +
geom_node_text(aes(label = name), size = 3, point.padding = NA, box.padding = 0, force = 0.1, repel = T) +
theme(aspect.ratio = 1) +
theme_graph(background = "white")
plt <- plot(pltGraph)
```
Until now, we have covered the whole analysis pipeline including data inspection, feature reduction (feature extraction/PCA, feature selection/LASSO) and supervised PLS-DA modeling with feature importance interpretation/correlation validation on our demonstrated dataset.
Keep in mind that besides the above described analysis steps, data collection and pre-processing including balancing strategies for confounding factors, data imputation and model validation are equally important to make you success, even though these steps are not covered in this tutorial.
Enjoy your analysis trip and good luck on your own project!