-
Notifications
You must be signed in to change notification settings - Fork 0
/
systemsSerology_Easy.Rmd
445 lines (314 loc) · 22.1 KB
/
systemsSerology_Easy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
---
title: "Systems Serology Machine Learning Workshop Easy"
output:
html_document:
df_print: paged
---
## Getting started
Install the following packages before getting started (before the workshop):
* readxl: data import
* ggpubr: statistical test
* corrr: correlations
* ropls: multivariate analysis
* glmnet: feature selection
* DMwR: data imputation
* pheatmap: visualization
* ggplot2: visualization
* ggrepel: visualization
* RColorBrewer: color schemes for visualization
* igraph: network analysis
* ggraph: network visualization
* tidyverse: unnecessarily fancy variable handling
## Data import and inspection
The data is taken from the following publication: Lu et al., IFN-γ-independent immune markers of Mycobacterium tuberculosis exposure, Nature Medicine (2019).
1) First, we import the data using the function 'read_excel'.
Note: To learn more about individual functions and their usage, use the command ?function_name, e.g., ?read_excel.
2) To visualize the data we plot a heatmap using 'pheatmap'.
3) We need to define X (the measurements) and y (the output we want to predict) for the further analysis.
The column names (colnames) and row names (rownames) of X should be the features and sample IDs, respectively.
4) There are missing values in the dataset. We impute missing values with k nearest neighbors imputation using the package 'DMwR'. It is very important to evaluate how the imputation affects the model, however, this is beyond scope of this workshop.
5) The measurements are on different scales, to ensure comparison across different features we will standardize the data (Z-score).
```{r data import, fig.width = 20}
library(readxl)
library(pheatmap)
library(DMwR)
# Import the data from the excel sheet
data = read_excel("RSTR_data_NatureMedicine.xlsx")
head(data)
# Define X and y
X <- as.matrix(data[, 5:dim(data)[2]]) #Column 1:4 contains demographics and outcome
rownames(X) <- paste0("s", seq(nrow(X))) #generates sample ids as 's1', 's2', 's3', ...
y <- factor(data$dx)
names(y) <- rownames(X)
# We make a heatmap of the raw data
pheatmap(X)
# Data imputation for the missing vlaues.
X <- knnImputation(X)
y <- y[match(rownames(X), names(y))] #ensures that the order of samples in Y is same as in X
# Z-score the data to bring the measurements to the same scale (mean 0 and standard deviation 1 across samples)
X_zscore <- scale(X, center = TRUE, scale = TRUE)
# Display heatmap of the z-scored data indicating phenotypic outcome
annotation_row = data.frame(phenotype = y)
rownames(annotation_row) <- rownames(X)
pheatmap(X_zscore, annotation_row = annotation_row)
```
## Description of the data
Response variable:
The phenotype we want to predict is latent Mtb infection (LTB) or resistors (RSTR). The classification is based on the diagnosis.
Specifically, LTB and RSTR are defined in paper as following:
* LTB: A population is diagnosed based on evidence of immune sensitization to Mycobacterium tuberculosis (Mtb) antigens and the absence of clinical symptoms of tuberculosis (TB) or direct microbiologic evidence of disease.
* RSTR: a population of house hold contacts which were highly exposed to Mtb yet remained persistently negative in interferon-r (IFN-r) release assay (IGRA) and tuberculin skin tests (TST) over an average of 9.5 years of follow-up for each individual.
Features:
Different assays/tests are performed together with different antigens. The nomenclature is as follows:
[antigen abbreviation].[test abbreviation].[metric]
List of antigen abbreviations:
ppd - purified protein derivative, proteins extracted from Mtb used in diagnosis
ag85 - antigen 85, Mtb protein
esat6 and cfp10 - Mtb protein
lam - lipoarabinomannan from Mtb cell-wall
flu - Influenza virus peptides (respiratory pathogen)
pneumo - S. pneumoniae peptides (respiratory pathogen)
cmv - cytomegalo virus (Epstein Barr) (non-respiratory pathogen)
vzv - varicellazostervirus (non-respiratory pathogen)
rubella - rubella virus (non-respiratory pathogen)
tetanus - tetanus virus (non-respiratory pathogen)
groes - chaperone protein GroES
hspx - α-crystalline
cl - cardiolipin
ps - phosphatidylserine
b2gp1 - recombinant human apolipoprotein H/b2GP1
gp140 - HIV glycoprotein gp140
rsv - Respiratory syncytial virus
List of test abbreviations:
CD107a - Functional test of NK cell degranulation (CD107a)
IFNg - Functional test of NK cell degranulation (IFNg)
MIP1b - Functional test of NK cell degranulation (MIP1b)
ADNP - Functional test of Ab-dependent neutrophil phagocytosis
ADCP - Functional test of Ab-dependent cellular phagocytosis
IgG.elisa - IgG ELISA
IgG, IgM, IgG1, IgG2, IgG3, IgG4, IgA1, IgA2 - luminex FC subclasses and isotypes
IgG1_2 - ratio IgG1:IgG2
IgG1_3 - ratio IgG1:IgG3
IgG1_4 - ratio IgG1:IgG4
IgG2_3 - ratio IgG2:IgG3
IgG3_4 - ratio IgG3:IgG4
FcR.3AF - luminex FcR allele variant (3AF)
FcR.3AV - luminex FcR allele variant (3AV)
FcR.2AH - luminex FcR allele variant (2AH)
FcR.2AR - luminex FcR allele variant (3AR)
G0, G0*, G0B, G0F, G0FB,... - IgG glycans
List of metric abbreviations:
AUC - area under the curve calculated from median fluorescence intensities (MFIs) from three dilutions
5, 10, 100, 1000 - dilutions
Meta data:
The dataset also includes metadata about the subjects including age, sex, and BMI that we do not use in the analysis.
## Univariate analysis
In the following, we will visualize the IgG1 values for the individual classes across antigens. We annotate the significance using the Wilcox test with the library 'ggpubr'. Note: here, the p-values are not corrected for multiple hypothesis testing!
```{r univariate}
library(ggplot2)
library(ggpubr)
# Extract indices for features including the IgG AUCs
indFeatures <- which(grepl("IgG1.1000", colnames(X_zscore)))
# Reformat the data for the following plotting procedure
dfBox <- reshape2::melt(cbind(as.data.frame(X_zscore[,indFeatures]), class = y), id.var = "class")
colnames(dfBox) <- c("class", "features", "value")
ggplot(dfBox, aes(x = features, y = value, fill = class)) +
stat_boxplot(position = position_dodge(width = 0.75),
geom = 'errorbar', width = 0.2) + # boxplot
geom_boxplot(outlier.shape = NA) + # outliers are plotted below
geom_point(position = position_jitterdodge(), shape = 21, size = 1,
aes(fill = class), color = "black") + # add points for individual measurements
ylab("z-scored value") +
theme_classic() + # visualization style
theme(axis.text.x = element_text(angle = 40, hjust = 1)) + # rotate x-labels
stat_compare_means(aes(group = class), label = "p.signif", method = "wilcox.test") # comparison of groups using Wilcox test
```
## Multivariate analysis
In this part of the workshop, we are going to analyze the data in a multivariate fashion to find multivariate signatures that correlate with resistance. First, we visualize the high-dimensional data in a two-dimensional subspace using principal component analysis (PCA), an unsupervised machine learning technique (i.e. only using our data matrix X). Next, to find a subset of important features, we perform feature selection using Least Absolute Shrinkage and Selection Operator (LASSO). Finally, we build a partial least square discriminant analysis (PLS-DA) model to assess the relation between the features (X) and resistance (Y).
The PCA and PLS-DA will be performed using the package 'ropls'. A more detailed tutorial for the package can be found here: https://www.bioconductor.org/packages/devel/bioc/vignettes/ropls/inst/doc/ropls-vignette.html.
### Analysis on the whole data set
#### Principal component analysis (PCA)
Dimension reduction techniques are applied to transform high-dimensional data in order to visualize it in a lower dimensional subspace.
For this, a popular approach is PCA, which can be performed using the function 'opls' from the package 'ropls'.
```{r PCA}
# Task 1: Perform PCA on the z-scored data
#(remember that you can always check the documentation of a function e.g. using '?opls')
# Step 1, load the 'ropls' library, the command is library([name]), where [name] is the name of the library
# [... type the command below here...]
# Step 2 perform the PCA on the z-scored data and store the results in a variable named 'res_pca'
# the command for running PCA with opls is: [result variable] <- opls([data]) the data you want to use is stored in the X_zscore variable
# [... type the command below here...]
# This will generate a figure with 4 plots and a result object
```
We will pay more attention to the two plots on the bottom.
To obtain a nicer score plot (including a color coding of y), we can use the internal ropls 'plot' function (use 'ropls::plot').
```{r PCA score plot}
# Task 2: Plot only the score plot with color coded y
# (you can find the documentation for opls plot using '?plot.opls')
# Step 1, plot the result from above (that you stored in 'res_pca') using the command, ropls::plot([result], typeVc = "x-score", parAsColFcVn = [grouping]) the variable 'Y' contains the information about which group each sample belongs to. The ' typeVc = "x-score" ' part tells ropls to only show the score plot.
# [... type the command below here...]
```
There is a lot of overlap between RSTR and LTB, however, some of the difference appears to be explained by the second principal component (PC2). So let's have a closer look at the loadings plot, which can be done with the same plotting routine as the scores plot adjusting the plotting options.
```{r PCA loading plot}
# Task 3: Plot only the loading plot of the PCA
# Step 1, do the same as in task 2, but now with 'x-loading' instead of 'x-score' to display the loading
# [... type the command below here...]
```
This plot shows how individual features influence the principal components, highlighting the features that affect the components the most. The coordinates of the features on the axis shows the weight by which they contribute to the principal component. For example, a feature (ppd.G2S1) which is on the very top of the loadings plot contributes a lot to the second principal component, which we associated with LTB. However, the large number of features and the weak separation between LTB and RSTR makes this figure a challenge to interpret.
#### Feature selection
To select the minimal number of individual features that capture the overall variation among the RSTR and LTB, we use penalized logistic regression with cross validation implemented in the function 'cv.glmnet' from the 'glmnet' package. We extract the set of features which have a non-zero coefficient.
```{r feature selection}
library(glmnet)
# Task 4: Perform feature selection with 10 fold cross validation (CV) using the 'cv.glmnet' function with squared-error as metric and binomial response type.
#(the LASSO arguments are further described in ?glmnet and the cross validation arguments are described in ?cv.glmnet)
#(a more detailed description can be found at https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)
set.seed(1234) # Fix random seed for demonstration purposes (this ensures that we all will get the same result)
# Step 1, run LASSO with 10 fold CV by using the command cv.glmnet([data], [grouping], type.measure = "mse", family = "binomial", nfolds = [number of folds]) store results in a variable named 'res_lasso'. Here 'type.measure = "mse"' indicates that we use mean squared-error as metric and 'family = "binomial"' that we use a binomial model, i.e. logistic regression.
# [... type the command below here...]
# Step 2, Extract the coefficients from the trained model using the command 'coef([output from LASSO], s = "lambda.min")' the's = "lambda.min"' part tells R to extract coefficients only using the parameter that was optimized using CV. Store the results in a variable 'c' for further processing
#(this function is further described in ?coef.glmnet)
# [... type the command below here...]
# Step 3, figure out which of the coefficients were not 0. We do this with the command 'which(c[,1] != 0)', where 'c[,1]' indicates that we want the coefficients from the first column of c (there is only one). Store the result in a variable named 'inds'
# [... type the command below here...]
#Step 4, get the names of the features using the names([variable]) command store the results in a variable called 'selected_features'
# [... type the command below here...]
#Step 5, take a look at the features that were selected using the print([variable name]) command
# [... type the command below here...]
#Step 6, the features include the intercept term, we can remove it by keeping all elements in the variable apart from the first, using the following command:
selected_features <- selected_features[2:length(selected_features)]
#Step 7 Store data from the selected features in a variable for further analysis using the following command:
X_sel <- X_zscore[, selected_features]
```
#### PCA on feature selected data
Repeat the PCA using only the selected features. What do you observe?
```{r PCA sel}
# Task 5: Run PCA using data from the selected features
# Step 1, run PCA as you did before using opls([data]) but this time using only the feature selected data in X_sel, store results in a variable named 'res_pca2'
# [... type the command below here...]
# Step 2, plot only the score plot as in task 2 using the ropls::plot command
# [... type the command below here...]
#Step 3, plot only the loadings as in task 3 using the ropls::plot command
# [... type the command below here...]
```
Now that we focus on features that are able to explain the difference between RSTR and LTB there is much better separation between the groups in PCA, mostly on PC2.
Next, we can visualize the contribution of each features on distinguished PC2.
```{r PCA feature importance visualization}
PCA.loading <- res_pca2@loadingMN
df <- data.frame(VIP = PCA.loading[, 2], features = rownames(PCA.loading))
ggplot(data = df,
aes(x = reorder(features, abs(VIP)), y = VIP)) +
geom_bar(stat = "identity", color = "black") +
theme_classic() +
xlab("") +
ylab("PC2 loading scores") +
ggtitle("Feature Importance in PC2") +
coord_flip()
```
From the loading score in PC2, we can see that the CFP10/ESAT6 IG1 level is among the top features enriched among 'resisters'.
#### Partial least squares discriminant analysis (PLS-DA)
PCA is an unsupervised technique which does not include the y-value, i.e. resistance. We can include this information in the modeling by using a supervised technique, namely, partial least square discriminant analysis (PLS-DA). In contrast to PCA, it does not find the directions of highest variance of the data (X), but the directions of highest co-variance of features (X) and outcome (y). This model can also be obtained by the function 'opls' of the 'ropls' package.
Note: For the moment, do not pay attention to the output visualization. We will look closer at the corresponding parts in the following.
```{r PLS-DA1}
# Task 6: Run PLS-DA on data from the selected features
# Step 1. PLS-DA uses information from Y in the model construction, the command is opls([x data], [y data])
# [... type the command below here...]
```
For visualization purposes, set the number of latent variables (components) to two and only show the score and loadings plots.
```{r PLS-DA2}
# Task 7: Set the number of components to 2 and only show the score and loadings plots.
# Step 1, run the following command opls([x data], [y data], predI = [number of components]), store results in res_plsda_vis
# [... type the command below here...]
# Step 2, plot only the score plot as in task 2 using the ropls::plot command
# [... type the command below here...]
#Step 3, plot only the loadings as in task 3 using the ropls::plot command
# [... type the command below here...]
```
From the bottomright plot, we observe that the model clearly separates the samples from RSTR and LTB.
We always need to validate the model's robustness on the dataset. Here, we simply divide the whole samples into two groups with the ratio of 50%. Then, the model is trained in one subset (train set) and tested on another subset (validation set). More complex evaluation methods could be applied like k-fold cross-validation or leave-one-out cross-validation, which will not be covered here.
Warning: This only is a small part of the generally required model validation, since it does not include a validation of the feature selection and is performed only once (i.e., for one realization of the data split)! See Ackermann, Das, et al., Route of immunization defines multiple mechanisms of vaccine-mediated protection against SIV, Nature Medicine, 2019 for an example and more detailed description of how to validate the model (in a permutation test setting).
```{r PLS-DA2 validation}
res_plsda_train <- opls(X_sel, y, predI = 2, subset = "odd") #here "odd" means evenly divide the whole samples into two groups.
ropls::plot(res_plsda_train,
typeVc = "x-score")
ropls::plot(res_plsda_train,
typeVc = "x-loading")
#the performance on the train subset.
trainVi <- getSubsetVi(res_plsda_train)
table(y[trainVi], fitted(res_plsda_train))
#compute the performance on the test subset
table(y[-trainVi],
predict(res_plsda_train, X_sel[-trainVi, ]))
```
From the scores plot, we still can see that the model performs good on the classification. Furthermore, from the confusion matrix in the validation subset, we see that the model gives a good prediction. In whole, we are more confident that the model is robust on the classification of LTB and RSTR in the dataset.
##### VIP scores
To assess the importance of the features in the PLS-DA model, the variable importance in projection (VIP) score can be used. The VIP scores can be obtained using the function 'getVipVn' of the package 'ropls'.
For this, you can use the visualization of the correlation coefficients as a template.
```{r VIP scores subset}
library(ggplot2)
# Task 8: Get VIP scores and visualize them
# Step 1, get the VIP scores using the command 'getVipVn([plsda model])' and store them in a variable named 'VIP'
# [..your code...]
# Step 2, reformat the variable to prepare it for ploting
df <- data.frame(VIP = VIP, features = names(VIP))
# Step 3, we visualize the results using ggplot
ggplot(data = df,
aes(x = reorder(features, abs(VIP)), y = VIP)) +
geom_bar(stat = "identity", color = "black") +
theme_classic() +
xlab("") +
ylab("VIP scores") +
theme(axis.text.x = element_text(angle = 90)) +
coord_flip()
```
From the VIP scores, we observe that CFP10/ESAT6 IgG1 and PPD-specific features are important to distinguish the LTB and RSTR groups.
#### Co-correlates
The regularization for feature selection picks features which are predictive for the outcome. However, also other features which are highly correlated with the chosen features could be used instead of the selected features. To take this into account it is important to assess the co-correlates, i.e., the features which correlate with the selected features.
Here, we could visualize the correlation among different features using two ways: a correlation matrix and a correlation network. Specifically, only the selected features and features that have an absolute correlation coefficient greater than 0.7 to one of the selected features are visualized.
1. One way of (visually) assessing the correlation matrix is with a correlation heatmap.
```{r correlation heatmap}
library(RColorBrewer)
corrMat <- cor(X, method = "spearman")
# heatmap for full correlation matrix
pheatmap(corrMat, cluster_rows = FALSE, cluster_cols = FALSE,
color = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100)) # fix color scale to make interpretation easier
# keep only selected features and highly correlated features
colMax <- function(X) apply(X, 2, max)
corrMat <- corrMat[selected_features, ]
corrMat <- corrMat[, which(colMax(abs(corrMat)) < 1 & colMax(abs(corrMat)) > 0.7)]
# heatmap for reduced correlation matrix
pheatmap(corrMat, cluster_rows = FALSE, cluster_cols = FALSE,
color = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100),
breaks = seq(-1, 1, length = 100))
```
From this correlation heatmap, we could identify several groups of features with similar characteristics. For example, we see that ppd.G2S1 is negatively correlated with the ppd.IgG1:IgG2 ratio and ppd.IgG1, while CFP10/ESAT6.IgG1 is positively correlated with the ppd.IgG1:IgG2 ratio and ppd.IgG1.
2. Another way of assessing the co-correlates is in network form with nodes representing the features and edges representing the correlation coefficient. Networks can be plotted in R using 'ggraph'.
```{r cocorrelate network, fig.width = 10}
library(tidyverse)
library(corrr)
library(igraph)
library(ggraph)
library(ggrepel)
library(RColorBrewer)
# include meta data in analysis
Xmeta <- data[,2:dim(data)[2]]
tidyCors <- Xmeta %>%
correlate(method = "spearman") %>%
stretch()
graphCors <- tidyCors %>%
filter(abs(r) > 0.6 & (is.element(x, selected_features) | is.element(y, selected_features))) %>%
graph_from_data_frame(directed = FALSE)
layout <- create_layout(graphCors, layout = 'igraph', algorithm = 'nicely')
pltGraph <- ggraph(layout) +
geom_edge_link(aes(color = r), edge_width = 2) +
guides(edge_alpha = "none", edge_width = "none") +
scale_edge_colour_gradientn(limits = c(-1, 1),
colors = colorRampPalette(brewer.pal(n = 10, name = "RdBu"))(100)) +
geom_node_point(shape = 21, color = "gray", fill = "white", size = 8, stroke = 0.5) +
geom_node_text(aes(label = name), size = 3, point.padding = NA, box.padding = 0, force = 0.1, repel = T) +
theme(aspect.ratio = 1) +
theme_graph(background = "white")
plt <- plot(pltGraph)
```
Until now, we have covered the whole analysis pipeline including data inspection, feature reduction (feature extraction/PCA, feature selection/LASSO) and supervised PLS-DA modeling with feature importance interpretation/correlation validation on our demonstrated dataset.
Keep in mind that besides the above described analysis steps, data collection and pre-processing including balancing strategies for confounding factors, data imputation and model validation are equally important to make you success, even though these steps are not covered in this tutorial.
Enjoy your analysis trip and good luck on your own project!