-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcomparison.qmd
494 lines (377 loc) · 24.9 KB
/
comparison.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
# Comparing the integration results {#sec-comparison}
```{r}
#| child: "_setup.qmd"
```
```{r loading-packages}
#| include: false
library(targets)
library(moiraine)
library(purrr)
library(dplyr)
library(ggplot2)
```
```{r setup-visible}
#| eval: false
library(targets)
library(moiraine)
## For data-frames manipulation
library(dplyr)
## For colour palettes
library(ggplot2)
```
As different integration methods use different statistical approaches and make different assumptions about the data, it can be very interesting to apply them to a same multi-omics dataset and compare the results. In this chapter, we will see how the `moiraine` package can be used to compare the results from different integration methods.
As a reminder, here is what the `_targets.R` script should look like so far:
<details>
<summary>`_targets.R` script</summary>
```{r previous-vignettes}
#| echo: false
#| results: asis
knitr::current_input() |>
get_previous_content() |>
cat()
```
</details>
## Introduction
As explained in @sec-interpretation-dim-reduction, the integration tools considered in `moiraine` all perform dimension reduction of the omics datasets, and therefore all construct latent dimensions which can be described in terms of their features weight and their samples score. Therefore, we can compare the results of different integration methods in terms of their latent dimensions, by comparing either the latent dimensions' samples score, or their features weight. When comparing the samples score, we are answering the question: *are the different methods capturing the same trends from the datasets?* By comparing features weight, we are instead answering a slightly different question: *are the different methods highlighting similar sets of features as driving variation in the datasets?* Of course, these comparisons are related, as two latent dimensions that identify very different trends in the data will most likely give high importance to very different sets of features.
Note that, although this chapter mainly considers a comparison of different integration methods, we could also compare the results of a same integration but with different pre-processing options (e.g. to assess the impact of datasets normalisation or pre-filtering on the integration results). To illustrate this, we will run a second MOFA analysis, but this time using the datasets that have been pre-filtered to retain the most variable features (@sec-prefiltering-unsupervised), as opposed to the supervised pre-filtering results that we have used in the main analysis (@sec-prefiltering-supervised).
::: {.targets-chunk}
```{targets mofa-unsupervised}
list(
## Creating the input object for the MOFA pipeline
## using the unsupervised preselection results
tar_target(
mofa_unsupervised_input,
get_input_mofa(
mo_presel_unsupervised,
options_list = list(
data_options = list(scale_views = TRUE),
model_options = list(
likelihoods = c(
"snps" = "poisson",
"rnaseq" = "gaussian",
"metabolome" = "gaussian")
),
training_options = list(seed = 72)
),
only_common_samples = FALSE
)
),
## Training the model with the MOFA algorithm
tar_target(
mofa_unsupervised_trained,
run_mofa(
mofa_unsupervised_input,
save_data = TRUE,
use_basilisk = TRUE
)
),
## Formatting MOFA output
tar_target(
mofa_unsupervised_output,
get_output(mofa_unsupervised_trained)
)
)
```
:::
## Extracting integration results
The first step in order to compare the results from different methods is to convert them to a standardised format with the `get_output()` function, which we did in @sec-interpretation. The function returns an `output_dimension_reduction` object which contains information about the latent dimensions' features weight and samples score, as well as a summary of the percentage of variance explained for each dataset. The `output_dimension_reduction` object created also has an attribute called `method`, which contains the name of the integration method used:
```{r show-output-method}
tar_load(diablo_output)
attr(diablo_output, "method")
```
To facilitate comparisons, we will combine these objects into a list:
::: {.targets-chunk}
```{targets output-list}
tar_target(
output_list,
list(spls_output, so2pls_output, mofa_output, diablo_output)
)
```
:::
This list will be the input of most of the functions that we will showcase in this chapter.
```{r read-output-list}
tar_load(output_list)
```
You can notice that we didn't give any name to the elements of the list that we just created. When comparing results from different integration methods, the comparison functions implemented in `moiraine` automatically extract the name of the method from each object (as shown above), and use this as labels in the plots. However, when we want to compare several results from a same integration method, we need to give unique names to the different elements in the output list, e.g.:
::: {.targets-chunk}
```{targets output-list-mofa}
tar_target(
output_list_mofa,
list(
"MOFA (supervised pref.)" = mofa_output,
"MOFA (unsupervised pref.)" = mofa_unsupervised_output
)
)
```
:::
```{r read-output-list-mofa}
tar_load(output_list_mofa)
```
## Clustered correlation heatmaps
We can visualise the correlation between the samples score and features weight of the different latent dimensions constructed by the integration methods as a heatmap, using the `comparison_heatmap_corr()` function. The main input parameter of the function is a list of `output_dimension_reduction` objects, which we constructed in the previous section:
```{r comparison-heatmap-corr}
#| fig.width: 8
#| fig.height: 7.5
comparison_heatmap_corr(output_list)
```
The function generates two half-heatmaps. The heatmap on the left is a visualisation of the correlation between the features weight of the different latent dimensions; the one on the right shows the correlation between their samples score. Since correlation matrices are symmetric, only one triangle of each matrix is represented. The rows and columns of the heatmaps each correspond to one of the latent dimensions generated by one of the integration methods. Their name is abbreviated (C stands for component, F for Factor, JC for joint component, RSC for rnaseq-specific component and MSC for metabolome-specific component). The method through which each latent dimension was generated is indicated next to its name as a coloured annotation. In each heatmap, the rows and columns have been ordered according to a clustering performed on the correlation matrix, so that the latent dimensions most similar (in terms of samples score or features weight) are next to each other.
By default, the function represents all of the latent dimensions generated by each method, but the `latent_dimensions` parameter allows us to select only the ones we are interested in comparing. It accepts a named list, where each element gives for an integration method the latent dimensions to use. The names of the list must correspond to methods name. For example, we will look at all latent dimensions from sPLS and DIABLO, but restrict to only the first four MOFA factors, and only the joint component and first two dataset-specific components for sO2PLS:
```{r comparison-heatmap-corr-subset}
#| fig.width: 8
#| fig.height: 7.5
comparison_heatmap_corr(
output_list,
latent_dimensions = list(
"sO2PLS" = c(
"joint component 1",
paste("metabolome specific component", 1:2),
"rnaseq specific component 1"
),
"MOFA" = paste("Factor", 1:4)
)
)
```
From the heatmaps, we can see that some latent dimensions constructed by the different methods seem to capture similar trends in the data. For example, MOFA factor 1, sPLS component 1, DIABLO component 1 and sO2PLS joint component 1 are all strongly correlated in terms of their samples score. Their correlation in terms of features weight is a bit lower, which is due to the fact that some methods perform features selection, therefore all non-selected features are given a weight of 0. Note that the sign of the correlation is interesting but not very important. We can also see some latent dimensions that seem correlated with respect to one metric but not the other. For example, there is a strong correlation between the samples score of MOFA factor 4 and sPLS component 2, but this is not reflected in their features weight. Again, that can be because one method performs latent selection and not the other. On the other hand, the correlation between sPLS component 3 and DIABLO component 2 is stronger when looking at their features weight than at their samples score.
When using the function to compare the results of the two MOFA runs, the names that we gave to the list of output will be used as method name, and these are the names to use in order to select some latent dimensions of interest. For example:
```{r comparison-heatmap-corr-mofa}
#| fig.width: 7
#| fig.height: 7
comparison_heatmap_corr(
output_list_mofa,
latent_dimensions = list(
"MOFA (supervised pref.)" = paste0("Factor ", 1:5),
"MOFA (unsupervised pref.)" = paste0("Factor ", 1:5)
),
legend_ncol = 1
)
```
## A note about missing features
Typically, when comparing the output of different integration methods, we assume that the same input data was used for each method. However, when we are instead comparing the impact of data pre-processing on the integration results, it is possible that not all features are present in the input data for each of the methods. This is for example the case when assessing the impact of different prefiltering settings (e.g. supervised vs unsupervised) on the results of a particular integration method. It is also the case if we are comparing the integration results when using different subset of omics data as input: in our example, both DIABLO and MOFA were used on the three omics datasets (genomics, transcriptomics and metabolomics), while sPLS and sO2PLS were run on the transcriptomics and metabolomics datasets only. If that is the case, some features will have no weight in some of the results we are comparing (as they are not in the input data analysed by the integration method). There are two approaches to deal with that:
* Approach 1: remove these features from the comparison. The correlation between the features weight of any two latent dimensions will only be calculated using the features that are assigned a weight by both latent dimensions. This is the default approach used by `comparison_heatmap_corr()` and other comparison functions.
* Approach 2: the features that are assigned a weight by some methods but not others are retained for the comparison, and the missing weights are replaced with zero values.
The problem with approach 2 is that it might artificially decrease the computed correlation coefficients, which is why approach 1 is the default in this package. However, approach 2 may reveal interesting patterns and therefore was also implemented.
In `comparison_heatmap_corr()` and other comparison functions where relevant, the choice between these two approaches is controlled by the `include_missing_features` parameter. The first approach is selected with `include_missing_features = FALSE`, while the second approach is selected with `include_missing_features = TRUE`.
Note that the first approach does not remove features that were discarded as a result of feature selection by one of the integration method, but only features that were not in the input data for the integration method. For example, sO2PLS performs feature selection for the joint components, and as a consequence the features that are not selected as assigned a weight of 0. These features are not impacted by the choice of approach.
## Correlation matrices
We can obtain the correlation coefficients displayed in the heatmaps with the `get_features_weight_correlation()` and `get_samples_score_correlation()` functions:
```{r get-corr-mat}
get_samples_score_correlation(output_list)[1:5, 1:5]
get_features_weight_correlation(output_list)[1:5, 1:5]
```
Note that the formatting of the row and column names is designed for easier handling in the various plotting functions.
## Pairwise comparisons
While the `comparison_heatmap_corr()` function can handle any number of methods to compare (in our example it was 4), the heatmap can become difficult to interpret. As an alternative option, it is possible to visualise the correlation between the latent dimensions of any two methods. This is what the function `comparison_plot_correlation()` is for. It also takes as input a list of methods' output, but the list must only contain two elements. For example, we will compare the results of sPLS and sO2PLS:
```{r comparison-plot-correlation}
#| fig.width: 10
#| fig.height: 10
comparison_plot_correlation(output_list[2:1])
```
As with `comparison_heatmap_corr()`, the function displays the correlation between the latent dimensions' samples score on the left, and between their features weight on the right, but using correlation plots rather than heatmaps. By default, only correlation coefficients above 0.2 have their value displayed (for better clarity), but this can be customised through the `min_show_corr` argument.
By default, the function plots both samples score correlations and features weight correlation, but it is possible to display only one of them by setting the `by` argument to either `samples` or `features`. In addition, as for the heatmap plotting function, it is possible to focus on certain latent dimensions by passing a named list of latent dimensions name to the `latent_dimensions` argument:
```{r comparison-plot-correlation-subset}
#| fig.width: 10
#| fig.height: 6
comparison_plot_correlation(
output_list[2:1],
by = "samples",
latent_dimensions = list(
"sO2PLS" = c("joint component 1", "rnaseq specific component 1")
)
)
```
## Comparing samples score
The correlation plots shown above provide a useful summary of the similarity between the latent dimensions constructed by different integration methods. We can then investigate further how any two latent dimensions (from two different integration methods) are related. One way to do so is to compare their samples score in a scatterplot. In @sec-interpretation-samples-pair, the `plot_samples_score_pair()` function has been used to display the samples score of two latent dimensions from the result of an integration method, e.g.:
```{r plot-samples-score-pair-diablo}
#| fig.height: 7
tar_load(mo_set_complete)
plot_samples_score_pair(
diablo_output,
c("Component 1", "Component 2"),
mo_data = mo_set_complete,
colour_by = "status"
) +
scale_colour_brewer(palette = "Set1")
```
This function can also be used to compare the samples score of two latent dimensions from two integration different methods. This is done by passing to the function a list of length 2 containing the output of two different methods. The name of the latent dimensions to compare are provided as a named list, where each name corresponds to either the name of the method (if the input list is not named) or the name of the element in the input list. So for example, to compare the first latent dimension of MOFA and DIABLO:
```{r plot-samples-score-pair}
#| fig.height: 7
plot_samples_score_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
mo_data = mo_set_complete,
colour_by = "status"
) +
scale_colour_brewer(palette = "Set1")
```
We can see that the samples score of the two latent dimensions are strongly correlated, showing that the two latent dimensions capture a similar trend in the data.
## Comparing features weight
In a similar way, we can compare the features importance score of two latent dimensions from two different integration methods. In @sec-interpretation-features-pair, the `plot_features_weight_pair()` function has been used to display the features weight or importance score of two latent dimensions from the result of an integration method, e.g.:
```{r plot-features_weight-pair-mofa}
plot_features_weight_pair(
diablo_output,
c("Component 1", "Component 2"),
mo_data = mo_set_complete,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
This function can also be used to compare the features weight of two latent dimensions from two different integration methods. This is done by passing to the function a list of length 2 containing the output of two different methods. The name of the latent dimensions to compare are provided as a named list, where each name corresponds to either the name of the method (if the input list is not named) or the name of the element in the input list. We will again compare MOFA factor 1 and DIABLO component 1:
```{r plot-features-weight-pair}
plot_features_weight_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
mo_data = mo_set_complete,
features_metric = "importance",
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
We can see that while MOFA and DIABLO identify the same metabolites as being the most important to separate healthy and infected animals, the genomic markers and genes that are given the highest importance score by MOFA are not selected with DIABLO.
By default, the function plots the features signed importance (i.e. importance score to which the sign of their weight was added), but in this case it makes sense to look at their (non-signed) importance score, which we have done by setting the `features_metric` argument to `'importance'`. Note that we can also compare their raw weight, by setting `features_metric` to `'weight'`. If that is the case, keep in mind that the weights from different latent dimensions live on different scales, so the raw values are not directly comparable (but the features ranking and weight sign are). We can also focus on a subset of datasets by passing their name to the `datasets` argument:
```{r plot-features-weight-raw}
plot_features_weight_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
features_metric = "weight",
datasets = c("rnaseq", "metabolome"),
mo_data = mo_set_complete,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
By default, the `plot_features_weight_pair()` function uses the geometric consensus importance metric (more details in the next section) to highlight the 5 features identified as most important by both methods. Both the number of features highlighted and the metric used can be controlled, through the `top_n` and `metric` arguments, respectively. In the following section, we will expand on the concept of consensus importance and the different metrics available.
## Feature consensus importance {#sec-comparison-consensus}
### Possible metrics
When comparing the results of several integration methods, we might find that several methods uncover similar trends in the datasets, through the latent dimensions they construct. For these latent dimensions, it is interesting to assess which features are consistently selected by the different methods as driving these trends. We can do so by computing the consensus importance score of the features, which summarises the importance given to the features by the different integration methods. Different metrics can be used to compute the consensus importance score. With $s_j^i$ denoting the importance score of feature $j$ from method $i$, $i = 1 \dots n_i$ (more specifically, from one latent dimension constructed by the integration method), the following metrics are implemented in the package:
* Geometric mean (`geometric`): calculated as $CI_j^G = \exp\left(\frac{1}{n_i} \sum\limits_i \log(s_j^i)\right)$
* Harmonic mean (`harmonic`): calculated as $CI_j^H = \frac{1}{\frac{1}{n_i}\sum\limits_i\frac{1}{s_j^i}}$
* L2 norm (`l2`): calculated as $CI_j^{L2} = \sqrt{\sum\limits_i (s_j^i)^2}$
* Average (`average`): calculated as $CI_j^A = \frac{1}{n_i} \sum\limits_i s_j^i$
* Product (`product`): calculated as $CI_j^P = \prod\limits_i s_j^i$
* Minimum (`min`): calculated as $CI_j^{min} = \min\limits_i(s_j^i)$
* Maximum (`max`): calculated as $CI_j^{max} = \max\limits_i(s_j^i)$
The difference between these metrics can be visualised in the simple case where features are assigned two different importance scores, e.g. by two different integration methods:
```{r show-consensus-importance, fig.height = 7}
show_consensus_metrics()
```
In the plots, the consensus importance values have been normalised so that the highest value is 1. As we can see, metrics such as the geometric mean, harmonic mean, product or minimum will give higher consensus scores to features that are consistently assigned a high importance score across all methods, while features that have high importance score with one method but low score with the other will get a lower consensus score. Conversely, metrics such as the L2-norm or maximum prioritise features that are given a high importance score by at least one method, regardless of their importance score with other methods.
### Computing features' consensus importance score
The function `compute_consensus_importance()` takes as input a list of output objects from different integration methods, as well as a named list indicating which latent dimension from each output object should be considered. By default, the geometric mean is used, but this can be changed through the `metric` argument.
```{r compute-consenus-importance}
consensus_df <- compute_consensus_importance(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1")
)
consensus_df
```
To help compare the data-frame with the plots created in the previous section, we will add the features label to the consensus importance data-frame.
```{r adding-features-labels}
features_labels_df <- get_features_labels(
mo_set_complete,
list("rnaseq" = "Name",
"metabolome" = "name")
)
consensus_df <- consensus_df |>
left_join(features_labels_df, by = c("dataset", "feature_id"))
consensus_df |>
group_by(dataset) |>
slice_max(importance, n = 5)
```
We can see that, using the geometric mean, the feature from the metabolomics dataset that is given the highest consensus importance is citric acid. Note that the metric chosen matters less when the latent dimensions compared are in strong agreement, as it is the case here:
```{r compute-consensus-importance-average}
compute_consensus_importance(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
metric = "average"
) |>
left_join(features_labels_df, by = c("dataset", "feature_id")) |>
group_by(dataset) |>
slice_max(importance, n = 5)
```
In the `plot_features_weight_pair()` function, we can specify which metric should be used for the consensus importance to highlight the top features. For example, we can illustrate the top features according to the average consensus metric:
```{r plot-features-weight-average}
plot_features_weight_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
metric = "average",
datasets = c("rnaseq", "metabolome"),
mo_data = mo_set_complete,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
```
## Recap -- targets list
Although we didn't create many new target for the plots in this chapter, we can turn some plots into targets.
<details>
<summary>Targets list for comparing different integration methods' output</summary>
::: {.targets-chunk}
```{targets recap-targets-list}
list(
## Creating a list of integration methods output objects
tar_target(
output_list,
list(spls_output, so2pls_output, mofa_output, diablo_output)
),
## Heatmap for comparison of integration methods output
tar_target(
comparison_methods_heatmap_plot,
comparison_heatmap_corr(output_list)
),
## Correlation plot for comparison of sPLS and sO2PLS results
tar_target(
mofa_so2pls_correlation_plot,
comparison_plot_correlation(output_list[2:1])
),
## Comparison of samples score for MOFA factor 1 and DIABLO component 1
tar_target(
mofa_so2pls_samples_score_comparison_plot,
plot_samples_score_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
mo_data = mo_set_complete,
colour_by = "status"
) +
scale_colour_brewer(palette = "Set1")
),
## Comparison of features weight for MOFA factor 1 and sO2PLS joint component 1
tar_target(
mofa_so2pls_features_weight_comparison_plot,
plot_features_weight_pair(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1"),
mo_data = mo_set_complete,
label_cols = list(
"rnaseq" = "Name",
"metabolome" = "name"
)
)
),
## Table of features' consensus importance for MOFA factor 1 and sO2PLS joint
## component 1
tar_target(
mofa_so2pls_features_weight_consensus_importance,
compute_consensus_importance(
output_list[3:4],
list("MOFA" = "Factor 1", "DIABLO" = "Component 1")
) |>
left_join(
get_features_labels(
mo_set_complete,
list("rnaseq" = "Name",
"metabolome" = "name")
),
by = c("dataset", "feature_id")
)
)
)
```
:::
</details>