-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04_results.Rmd
1099 lines (847 loc) · 67.2 KB
/
04_results.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Results
## Packages/methods
### experDesign
`experDesign` package built in R was released for the first time on CRAN on 2020-09-08 after nearly a year after the initial release made on [github](https://github.com/llrs/experDesign).
After peer review it was published on a journal on 2021-11 [@revillasancho2021].
The package uses functional programming to create and modify objects and the features used.
The package bases its performance on the large body of work made by the R core team.
It adds the information to the introduced `data.frame` or returns an vector with the appropriate information.
`experDesign` functions are divided into several categories:
- Helper functions to aid on deciding how many batches are or how many samples per batch.
There are some also that report how good a given distribution of the samples felt for a given dataset.
- Functions generating indexes.
- Functions distributing the samples on indexes
```{r experdesign-flowchart, fig.scap="experDesign functions and workflow", fig.cap="`experDesign` functions and workflow. Workflow for users of the package showing which functions can be used depending on the experiment design they have."}
include_graphics("images/experDesign-flowchart.png")
```
Regarding time related variables `experDesign` will use them as factors, while issuing a warning to the user.
Since its development it has been used on a couple of RNA sequencing experiments that required a batch design, one of organoids bulk RNAseq (data not related to this thesis) and another one of biopsies bulk RNAseq from the \acr{BARCELONA} cohort (See appendix section \@ref(barcelona)).
It was also used to check if there is any observable batch effect on the datasets analyzed.
On the designed datasets `experDesign` avoided batch effects from the sequencing process.
However, on the organoids dataset, a change on the matrigel used to produce them introduced a batch effect that made it impossible to compare samples before and after that change (there were not any shared sample before and after the change of matrigel).
On the \acr{BARCELONA} cohort there were other problems described on the appendices (section \@ref(barcelona)).
```{r experDesign, include=FALSE, eval=FALSE}
library("dlstats")
library("dplyr")
cran_stats("experDesign") %>%
group_by(month = format(end, "%Y/%m")) %>%
summarise(d = sum(downloads)) %>%
ungroup() %>%
pull(d) %>%
median()
```
Since its release on CRAN it has had a median of \~400 downloads each month from RStudio repository mirror, showing the interest the community have on solutions like this.
\FloatBarrier
### BaseSet
`BaseSet` package, built in R, was released for the first time on CRAN on 2020-11-11, nearly two years after the initial work started on [github](https://github.com/llrs/BaseSet "Repository").
The package uses both functional programming and object oriented program to create and modify the TidySet S4 object defined[^results-1].
Mixing it with S3 generic functions it provides a powerful interface compatible with the tidyverse principles, a [group of packages](https://www.tidyverse.org/) following the same design.
The package provides a new class to handle fuzzy sets and the associate information.
[^results-1]: S4 is one of the object programming paradigms on R.
For a more complete overview and differences see Advanced R, freely available online: https://adv-r.hadley.nz/oo.html [@wickham2019].
`BaseSet` methods are divided into several categories:
- [General functions](https://docs.ropensci.org/BaseSet/reference/index.html#general) to create sets of the TidySet class or convert from it to a list or about the package.
- [Set operations](https://docs.ropensci.org/BaseSet/reference/index.html#set-operations) like adjacency cartesian product, cardinality, complement, incidence, independence, intersection, union, subtract, power set or size.
- [Functions to work with TidySets](https://docs.ropensci.org/BaseSet/reference/index.html#set-operations) to add relationships, sets, elements or some complimentary data about them.
Remove the same or simply move around data or calculate the number of elements, relations and sets.
- [Functions to read](https://docs.ropensci.org/BaseSet/reference/index.html#reading-files) files from formats where sets are usually stored in the bioinformatician field: GAF, GMT and OBO formats.
- Last, some [utility functions](https://docs.ropensci.org/BaseSet/reference/index.html#utiles) to use set name conventions and other auxiliary functions.
The package had a long development process with initial iterations based on GSEABase package which was later abandoned ([GSEAdv](https://github.com/llrs/GSEAdv)) to also include some uncertainty on the relationship of a gene with a given gene set.
The package formed part of an exploration of the Bioconductor community (project to develop, support, and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assay) for more modern and faster handling of sets than [GSEABase](https://bioconductor.org/packages/GSEABase).
There were three different packages created as part of this process, `BaseSet`, `BiocSet` released [on Bioconductor](https://bioconductor.org/packages/BiocSet/) and `unisets`, available [on github](https://github.com/kevinrue/unisets).
The three different approaches were presented at a birds of feather on BioC2019, the annual conference of Bioconductor on 2019.
The package passed the review on the rOpenSci organization ([See review](https://github.com/ropensci/software-review/issues/359 "Review on Github")) and is now part of the packages hosted there too.
```{r BaseSet, include=FALSE, eval = FALSE}
library("dlstats")
library("dplyr")
cran_stats("BaseSet") %>%
group_by(month = format(end, "%Y/%m")) %>%
summarise(d = sum(downloads)) %>%
ungroup() %>%
pull(d) %>%
median()
```
Since its release on CRAN it has had a median close to \~400 downloads each month from RStudio package manager.
\FloatBarrier
### inteRmodel
The package was build once the method used to find accurate models of the relationships of the data available of a dataset using RGCCA was established.
Using the package [on github](https://github.com/llrs/intermodel "inteRmodel") simplifies the process and makes easier to redo the model optimization used on this thesis.
The package has functions that can be grouped in three categories:
- Look for models and evaluate them:
To search for a model given some conditions, such as that all the blocks are connected, and check the models via bootstrapping or leave one out procedures.
- Reporting: To make better reports by improving handling of names or simplifying the objects or how to calculate scores.
- Building: To easier build correct models on \acr{RGCCA}, simplifying the process to create a symmetric matrix.
Currently it is only [available on github](https:intermodel.llrs.dev), so the number of downloads and usage is unknown but since its release a user has contacted to keep it up to date with development versions of RGCCA.
Currently, it is compatible with the next release of [`RGCCA` being prepared](https://github.com/rgcca-factory/RGCCA/tree/CRAN)[^results-2].
[^results-2]: We also contributed with some comments and feedback to the package to make it easier to read the source and check the inputs and improve the documentation so that it is coherent with the code and previous results of the functions.
The functions `analyze` helps to analyze code of a single integration, providing the results on a tidy format for further processing.
To create the connections between blocks the function `weight_design` is available.
It creates all the possible matrices given a number of blocks and a number of weights.
Optionally it can create just a subset of those based on a numeric vector.
However, it does not provide a way to have the design named.\
If the user wants to create their own design matrices, they can use `symm` and modify the design of the model with `subSymm`.
`symm`, takes an initial matrix to pick up the row and column names.
It is recommended that the user checks the design matrix is fully connected, which the package facilitates with the function `correct`.
This is also recommended even if the design matrices are created with `weight_design`.
To search models `search_models` starts with a initial connectivity of the blocks and creates all the combinations of connections given.
For the bootstrap procedure there is the function `boot_index` to create the bootstrapped index of samples to be used by `boot_index_sgcca`.
`boot_index` randomly selects as much samples as specified by the arguments to create as much indices as the required by the second argument.
If the bootstrapped samples used is not important one can use `boot_samples_sgcca`.
If the users want to perform a leave-one-out procedure they can use `looIndex`.
```{r intermodel-workflows, fig.scap="inteRmodel functions and workflow.", fig.cap="`inteRmodel` functions and workflow. Functions provided by the inteRmodel package to help on the process, search and validate models of relationships using RGCCA. Created with BioRender.com"}
include_graphics("images/inteRmodel-workflows.png")
```
For more information, you can access the [manual online](https://llrs.github.io/inteRmodel/) or once it is installed.
The package cannot choose which variables use from the block with information to split into several blocks.
However, it provides the `model_RGCCA` function to make it easy to prepare such variables for RGCCA input.
The inteRmodel procedure will only be useful if the user should decide which variables are independent from others and split them into different blocks.
To asses this the user can use the methods we used, as described on the above section \@ref(variance).
It is important to keep in mind the possible causal relationships on user's data [@greenland2002].
## Analysis
On the following sections the main results of analyzing each dataset are presented.
(ref:poins-col-shape) Each point represents a sample (colored and shaped by location).
(ref:poins-col-disease) Each point represents a sample (colored by location and shape by disease).
(ref:points-boot) Each point represents a bootstrapped sample (colored by model used).
### Puget's dataset {#results-puget}
On this dataset the different parameters and capabilities of \acr{RGCCA} were tested.
The three different methods, centroid, factorial or horst were tested and compared.
The main result of this comparison was that the differences of the selection of the variables mattered more than the number of variables selected with each method.
The models were tested with different weights on all three schemes: horst, centroid and factorial.
The horst and the centroid scheme were similar while the factorial resulted in the most different \acr{AVE} values (see [S1 Data](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0246367#pone.0246367.s001) of [@revilla2021]).
The centroid scheme was selected because it takes into account all the relationship regardless of the sign of the canonical correlation between the blocks.
It is similarity to horst scheme.
The effect of the sparsity value was measured by its effect on the inner \acr{AVE} scores and the combination of the different values for each block as can be seen in figure \@ref(fig:puget-tau2AVE).
```{r puget-tau2AVE, fig.cap="Effect of tau on the inner AVE in the Puget dataset. The suggested tau value is the column between the regular grid, on the ordinate axis the y's tau values and on the abscissa the gene expression (GE) on the left and the comparative genomic hybridization (CGH) on the right. The highest inner AVE is with high tau values for y and middle to upper values for GE and CGH.", fig.scap="Effect of tau on the inner AVE in the Puget dataset."}
knitr::include_graphics("images/pugets_tau_to_AVE.png")
```
The first model of the family of models 1 can be seen in table \@ref(tab:puget-model1).
The \acr{GE} is related to the metadata available.
Similarly, \acr{CGH} is also related to the location data available.
All other relationships are not included on the model.
```{r puget-model1}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1" = c("GE", "CGH", "metadata"),
GE = c(0L, 0L, 1L),
CGH = c(0L, 0L, 1L),
metadata = c(1L, 1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)), digits = 2,
caption.short = "Model 1 for the Puget dataset.",
caption = "Model 1 for Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. The matrix represent relations between the gene expression and CGH with the metadata.")
format_design(k, df)
```
When looking for the model that adjust better following this structure we arrived to model 1.2, described in (Table \@ref(tab:puget-model1-2)).
In this model the \acr{CGH} relationship with the location variables is much weaker than \acr{GE} with them.
```{r puget-model1-2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1.2" = c("GE", "CGH", "metadata"),
GE = c(0L, 0L, 1L),
CGH = c(0, 0, 0.1),
metadata = c(1, 0.1, 0),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)), digits = 2,
caption.short = "Model 1.2 for the Puget dataset.",
caption = "Model 1.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. CGH has a weaker relationship with the metadata than GE. ")
format_design(k, df)
```
On model 2 we split the metadata variables into invariable variables and those related to the location (Table \@ref(tab:puget-model2)).
We relate the \acr{GE} and \acr{CGH} to these blocks and between them.
```{r puget-model2}
df <- data.frame(
stringsAsFactors = FALSE,
'Model 2' = c("GE", "CGH", "Invariable", "Location"),
GE = c(0L, 1L, 1L, 1L),
CGH = c(1L, 0L, 1L, 1L),
Invariable = c(1L, 1L, 0L, 0L),
Location = c(1L, 1L, 0L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2 for the Puget dataset.",
caption = "Model 2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
Following this split, the model that has higher inner \acr{AVE} for these blocks is the following (Table \@ref(tab:puget-model2-2)).
```{r puget-model2-2}
df <- data.frame(
stringsAsFactors = FALSE,
'Model 2.2' = c("GE", "CGH", "Invariable", "Location"),
GE = c("0", "1/3", "0", "1"),
CGH = c("1/3", "0", "1/3", "0"),
Invariable = c("0", "1/3", "0", "0"),
Location = c(1L, 0L, 0L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.2 the for the Puget dataset.",
caption = "Model 2.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
If we added a superblock with all the data of the different blocks from model 1 we started with the standard relationship between blocks (Table \@ref(tab:puget-model-superblock)).
```{r puget-model-superblock}
df <- data.frame(
stringsAsFactors = FALSE,
'Model superblock' = c("GE", "CGH", "Superblock", "metadata"),
GE = c(0L, 0L, 1L, 0L),
CGH = c(0L, 0L, 1L, 0L),
Superblock = c(1L, 1L, 0L, 1L),
metadata = c(0L, 0L, 1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model with superblock for the Puget dataset.",
caption = "Model with superblock for the Puget dataset. Relationships between the different blocks in the Puget dataset for model superblock. 0 indicates no relationship and 1 indicates a strong relationship. The superblock is related to all other blocks but they are only related to the superblock.")
format_design(k, df)
```
But when the best model with the superblock that had highest inner \acr{AVE} is quite different (Table \@ref(tab:puget-model-superblock-2)).
It does not have a relationship between the superblock and the \acr{GE} and the metadata is related to \acr{GE} and \acr{CGH} too.
```{r puget-model-superblock-2}
df <- data.frame(
stringsAsFactors = FALSE,
'Model superblock.2' = c("GE", "CGH", "Superblock", "metadata"),
GE = c("1", "1/3", "0", "1"),
CGH = c("1/3", "0", "1", "0"),
Superblock = c(0L, 1L, 0L, 1L),
metadata = c(1L, 0L, 0L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model with superblock.2 for the Puget dataset.",
caption = "Model with superblock.2 for the Puget dataset. Relationships between the different blocks in the Puget dataset for model superblock.2. 0 indicates no relationship and 1 indicates a strong relationship. In this optimized design the GE and the metadata is related to GE and CGH too. The relationship between GE and CGH is weaker than other relations.")
format_design(k, df)
```
Exploratory analysis with the superblock model was done.
The first two components of the superblock did not help to explain the biology or classify the tumors (See figure \@ref(fig:puget-superblock)).
```{r puget-superblock, fig.scap="Superblock components of the Puget dataset.", fig.cap="Superblock components of the Puget dataset. First components of the model superblock.2 which has all the data of the samples on the Puget's dataset. Samples colored and shaped acording to the localization of the tumor. There is a mix of samples of all locations and then some far from it."}
knitr::include_graphics("images/pugets_superblock.png")
```
The same data was used to look for a good model from the data itself including a model with a superblock but looking at the first component of the \acr{CGH} and transcriptome block.
This allowed to visually inspect if each model's components helped to classify the samples (Figure \@ref(fig:puget-models)).
```{r puget-models, fig.scap="Different RGCCA models in the Puget dataset.", fig.cap="Different RGCCA models in the Puget dataset. The different models with the same data showing the sample position on the first components of the CGH and the transcriptome (or GE) block. Model 1 and 1.2 with transcriptomics, CGH data and all the data about the samples together. Model 2 and 2.2 with transcriptomics, CGH data and all the data about the samples on different blocks. Model superblock and superblock.2 have all the data from different blokcs and one block with all the data. All of them separate the samples according according to the transcriptome component by the localization of the tumors."}
knitr::include_graphics("images/pugets_models.png")
```
The first components of the \acr{CGH} and the transcriptomics blocks of the superblock.2 model show better classification than that of the superblock.
However, the other models show a better classification of the samples with much simpler models.
<!--# Slightly modified from the article!! -->
To find these models the three blocks with the best tau and the centroid scheme were analyzed by changing the weights between 0 and 1 by 0.1 intervals.
According to the inner \acr{AVE}, the best model was the one in which the weights (1) between the host transcriptome and location, (2) the host transcriptome and the \acr{CGH}, and (3) the \acr{CGH} block were linked to variables related to the location with weights of 1, 0.1 and 0.1, respectively.
When we added a superblock to the data, there was a slight increase of 0.01 on the inner \acr{AVE} of the model (See Table \@ref(tab:puget-models-ave)).
The model with the superblock that explained most of the variance was that in which the weights of the interaction within (1) the host transcriptome, (2) between the superblock and the \acr{CGH}, (3) between the host transcriptome and the localization, and (4) between acr{CGH} and the host transcriptome were 1, 1, 1 and 1/3, respectively (See table \@ref(tab:puget-model-superblock-2)).
To see if the superblock could classify the sample by location, we plotted the first two components of the superblock.
We can clearly see that they do not classify the samples according to the location of the tumor, which is known to affect the tumor phenotype [@puget2012].
Adding one block containing the age of the patient and the severity of the tumor to the model, decreased the inner \acr{AVE}.
The best model with these blocks, according to the inner \acr{AVE}, was that in which the interactions (1) within the host transcriptome, (2) between the host transcriptome and the localization, (3) between the host transcriptome and(4) the \acr{CGH} and between the \acr{CGH} and the other variables were 1, 1, 1/3 and 1/3, respectively.
The first components of each model can be seen in the figure:
We can observe on the figure \@ref(fig:puget-models), the strong dependency between gene expression and location since the first model while the weaker relationship with the \acr{CGH} assay [@puget2012].
On the other hand, the major difference is the dispersion on the \acr{CGH} component on each model.
We explored the effect of the superblock and weights on different models to the inner \acr{AVE}.
There are significant differences between having the superblock and not having it.
There is also some significant increase of the inner \acr{AVE} when adding a superblock.
```{r puget-weights, fig.scap="Effects of superblock and weights on the inner AVE on Puget's dataset.", fig.cap="Effects of superblock and weights on the inner AVE on Puget's dataset. Designs with the superblock showed higher inner AVE scores than without it. Interaction yes/no indicates RNA and RNA interaction. Higher inner AVE values are associated with relationships within the GE (p-values shown on top of the violin plots)."}
knitr::include_graphics("images/pugets-weights.png")
```
The different models resulted on the following \acr{AVE} values as reported in table \@ref(tab:puget-models-ave).
```{r puget-models-ave}
df <- data.frame(
stringsAsFactors = FALSE,
Model = c("1", "1.2", "2", "2.2", "superblock", "superblock.2"),
"inner AVE" = c(0.6333592,0.851236,0.2791546,
0.6902329,0.7055847,0.8047477),
"outer AVE" = c(0.06920968,0.06923187,
0.07386952,0.06927069,0.07345775,0.06958205),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "AVE values of RGCCA models in the Puget dataset.",
caption = "AVE values of RGCCA models in the Puget dataset. Values for both inner and outer AVE of the first canonical component of models 1, 1.2, 2 2.2 and superblock and superblock.2 are shown. The model with the highest inner AVE is the model 1.2 and superblock.2 of those with superblock.")
format_design(k, df)
```
The model 1.2, is the one with the highest inner AVE, followed by model superblock.2.
All of them classify the samples according to the type of tumor by the first component of the transcriptome.
\FloatBarrier
### HSCT dataset {#results-hsct}
The \acr{PERMANOVA} analysis was performed on this dataset to estimate which were the variables that are more relevant.
From the many variables the location, sex, patient id and others were found to be related to the variability of the microbiome or the transcriptome on this dataset.
<!--# From gourmiting 2 2018-03-13 -->
With the \acr{PERMANOVA} analysis we found that more of the 50% of the variance of normalized RNAseq data and microbiome data respectively is explained by the variables of location, disease, sex, and the interaction between disease and sex.
On the transcriptome the most important factor is location which is more than 15% of the variance, while on the microbiome data the most important factor is the patient id followed by location of the sample.
<!-- # Data on TRIM/intestinal_16S_RNAseq_metadb/models.RData and on TRIM/intestinal_16S_RNAseq_metadb/variance.R-->
```{r hsct-permanova-RNAseq}
df <- data.frame(Factor = c("Location", "IBD", "Sex", "IBD:Sex", "Location:IBD", "Location:Sex", "Location:IBD:Sex", "Residuals"),
R = c(0.18057, 0.03120, 0.01306, 0.01279, 0.02427, 0.00816, 0.00486, 0.72508),
"p-value" = c(.0002, .0002, .0012, 0.11798, 0.11458, 0.03519, 0.52190, NA),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "PERMANOVA analysis of the transcriptome in the HSCT dataset.",
caption = "PERMANOVA analysis of the transcriptome in the HSCT dataset. The variables and their interactions (shown with :) and the $R^2$ values and the associated p-value. The higher the $R^2$ the more variance is explained by that factor. Location and IBD are the biggest variables affecting the transcriptome, but overall they explain a small part of all the variability.")
format_design(k, df)
```
```{r hsct-permanova-microbiome}
df <- data.frame(Factor = c("Location", "IBD", "Sex", "IBD:Sex", "Location:IBD", "Location:Sex", "Location:IBD:Sex", "Residuals"),
R = c(0.06061, 0.04967, 0.01712, 0.01091, 0.02089, 0.01139, 0.00289, 0.82652),
"p-value" = c(.0002000, .0002000, .0003999, 0.6604679, 0.8476305, 0.00759185, 0.9994001, NA),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "PERMANOVA analysis of the microbiome in the HSCT dataset.",
caption = "PERMANOVA analysis of the microbiome in the HSCT dataset. The variables and their interactions (shown with :) and the $R^2$ values and the associated p-value. The higher the $R^2$ the more variance is explained by that factor. Location and IBD are the biggest variables affecting the microbiome, but overall they explain a small part of all the variability (smaller than with the gene expression).")
format_design(k, df)
```
Both tables (\@ref(tab:hsct-permanova-RNAseq) and \@ref(tab:hsct-permanova-microbiome)) show that location and IBD is a variable that affects the \acr{GE} and microbiome, but there are some interactions that explain similar percentage of variance.
<!--# global test from the comments on TRIM/intestinal_16S_RNAseq_metadb/variance.R -->
With globaltest the results were similar: sex, \acr{IBD}, location, age and time since diagnosis were able to explain the simple endoscopic score for Crohn's disease (p-value $5.7 \cdot 10^{-21}$ ).
The resulting p-value was well below the 0,05 threshold defined for RNAseq data on the models including the segment of the sample, sex and treatment.
On the microbiome data the results were similar but the p-value was considerably higher but still below the threshold.
```{r hsct-diversity, fig.scap="Microbiome diversity in the HSCT dataset.", fig.cap="Microbiome diversity in the HSCT dataset. On the upper section the Shannon effective and on the lower row the Simpson effective diversity splitted by colon and ileum and controls and CD. There is high diversity between the samples on the same condition."}
knitr::include_graphics("images/hsct-ASV-diversity.png")
```
Diversity indices of the samples were explored and compared for several subsets.
Splitting by location of the sample and disease provided the highest differences and the diversity index along time did not change much.
The \acr{PCA} didn't show any pattern on the microbiome according to the location, as can be seen in the first dimensions of the \acr{PCA} in figure \@ref(fig:hsct-pca-16s).
```{r hsct-pca-16s, fig.scap="PCA of 16S data from the HSCT dataset.", fig.cap = "PCA of the 16S data from the HSCT dataset. There are no clear patterns according to the location. (ref:poins-col-shape)"}
include_graphics("images/HSCT_PCA_16S.png")
```
The \acr{PCA} on the transcriptome shows a clear pattern splitting by location of the sample on the first dimension (see figure \@ref(fig:hsct-pca-RNAseq)).
```{r hsct-pca-RNAseq, fig.scap="PCA of RNAseq data from the HSCT dataset.", fig.cap = "PCA of the RNAseq data from the HSCT dataset. The samples separate according to the location. (ref:poins-col-shape)"}
include_graphics("images/HSCT_PCA_RNAseq.png")
```
Weighted gene co-expression network analysis did not provide relevant links between bacteria and transcriptome as it failed to find an acceptable scale free degree.
As can be seen in the figure \@ref(fig:hsct-wgcna-power), the scale free topology does not reach the recommended threshold of 0.9 and the mean connectivity is also very low even for the first power.
```{r hsct-wgcna-power, fig.scap="Power evaluation of WGCNA in the HSCT dataset.", fig.cap="Power evaluation of WGCNA in the HSCT dataset. On the ordinate the power on the abscissa on the left the scale free topology model fit; on the right the mean connectivity. There is a low fit even on large power and the mean connectivity is below 100 from the very first value."}
include_graphics("images/hsct-wgcna-power.png")
```
STATegRa was used between stool \acr{16S} data and intestinal \acr{16S} data under the assumption that there is a shared common factor without influence of other categorical variables.
However, it did not find a good agreement between these two data sources and \acr{16S} data source was not longer used on the analysis.
In addition, the model is fixed, so it did not allow to find new or other relationships that are not one to one.
With \acr{RGCCA} we could select different models and use all the data available without much assumptions.
The models with the highest inner \acr{AVE} of the family 1 and the family 2 models were similar to those on the Häsler dataset.
```{r hsct-models, fig.scap="Models in the HSCT dataset.", fig.cap="Models in the HSCT dataset. On the abscissa the transcriptome, on the ordinate the Microbiome. Each square represents a different model of the HSCT dataset. On panel A colored by disease status, on panel B colored by sample location. Model 0 has only transcriptome and microbiome data, models 1 to 1.2 with data about the samples and models 2.1 to 2.3 with data about the samples split in 3 blocks."}
include_graphics("images/hsct-models.png")
```
The weights of these models can be observed here in table \@ref(tab:hsct-model0).
```{r hsct-model0}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 0" = c("Transcriptome", "Microbiome"),
Transcriptome = c(0L, 1L),
Microbiome = c(1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 0 of the HSCT dataset.",
caption = "Relationships between the different blocks in the HSCT dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
If we include the information about the samples all together in a block called metadata we can start from this model in table \@ref(tab:hsct-model1-1).
```{r hsct-model1-1}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1.1" = c("Transcriptome", "Microbiome", "metadata"),
Transcriptome = c(0L, 0L, 1L),
Microbiome = c(0L, 0L, 1L),
metadata = c(1L, 1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 1.1 the of HSCT dataset.",
caption = "elationships between the different blocks in the HSCT dataset for model 1.1. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
When looking for the model that adjust better following this blocks we arrived to model 1.2 thanks to `inteRmodel`, described in table \@ref(tab:hsct-model1-2).
```{r hsct-model1-2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1.2" = c("Transcriptome", "Microbiome", "metadata"),
Transcriptome = c(0L, 0L, 1L),
Microbiome = c(0, 0, 0.1),
metadata = c(1, 0.1, 0),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 1.2 of the HSCT dataset.",
caption = "Relationships between the different blocks in the HSCT dataset for model 1.2. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
On model two we split the invariable variables from those related to the location:
```{r hsct-model2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 2" = c("Transcriptome", "Microbiome","Demographic","Location", "Time"),
Transcriptome = c(0L, 1L, 1L, 1L, 0L),
Microbiome = c(1L, 0L, 1L, 1L, 0L),
Demographic = c(1L, 1L, 0L, 0L, 1L),
Location = c(1L, 1L, 0L, 0L, 0L),
Time = c(0L, 0L, 1L, 0L, 0L),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2 of the HSCT dataset.",
caption = "Relationships between the different blocks in the HSCT dataset for model 2. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df, out = "scale_down")
```
Following this split, we used `inteRmodel` (See section above \@ref(intermodel)) to find the model that has higher inner \acr{AVE} for these blocks is the one in table \@ref(tab:hsct-model2-2).
```{r hsct-model2-2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 2.2" = c("Transcriptome",
"Microbiome","Demographic","Location","Time"),
Transcriptome = c(0L, 0L, 0L, 1L, 0L),
Microbiome = c(0, 0, 0.2, 0.1, 0),
Demographic = c(0, 0.2, 0, 0, 0.6),
Location = c(1, 0.1, 0, 0, 0),
Time = c(0, 0, 0.6, 0, 0),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.2 of the HSCT dataset.",
caption = "Relationships between the different blocks in the HSCT dataset for model 2.2. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df, out = "scale_down")
```
We also tested specifically a model from the family 2.3, which can be seen in table \@ref(tab:hsct-model2-3).
```{r hsct-model2-3}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 2.3" = c("Transcriptome",
"Microbiome","Demographic","Location","Time"),
Transcriptome = c(0, 0.1, 0, 1, 0),
Microbiome = c(0.1, 0, 0.1, 0.1, 0),
Demographic = c(0, 0.1, 0, 0, 1),
Location = c(1, 0.1, 0, 0, 0),
Time = c(0L, 0L, 1L, 0L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.3 of the HSCT dataset.",
caption = "Relationships between the different blocks in the HSCT dataset for model 2.3. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df, out = "scale_down")
```
The best model of the family 2 confirmed a relationship between the host transcriptome and the location-related variables, while the microbiome was associated with the demographic and location-related variables (see figure \@ref(fig:hsct-models) and [S2](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0246367#pone.0246367.s002) data of [@revilla2021]).
Overall, we see that the relationships in the model affected the distribution of samples on the components of both the host transcriptome and the microbiome.
```{r hsct-models-ave}
df <- data.frame(
Model = c(0, 1, 1.1, 1.2, 2, 2.1, 2.2, 2.3),
"inner AVE" = c(0.3999234,0.623019,0.5678189,
0.7043881,0.2517363,0.6940253,0.818764,0.7761846),
"outer AVE" = c(0.10016886,0.08423331,
0.08487135,0.07757663,0.09820497,0.09402657,0.09416275,
0.09439379),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "The models in the HSCT dataset and their AVE values.",
caption = "The models in the HSCT and their AVE values. For each model the inner AVE and the outer AVE is presented.")
format_design(k, df)
```
The different models selected different variables, some of which are shared between models.
The most similar models are those that have split the metadata into 3 blocks, followed by those that have the metadata in a single block.
In order to analyze the accuracy of the models, one thousand bootstraps were used to integrate the data from the \acr{HSCT} \acr{CD} dataset.
Each bootstrap had its own dispersion on the variables according to the samples selected, the distribution of the bootstraps used are represented here:
```{r hsct-bootstrap-quality, fig.scap="Characteristics of the samples from the HSCT bootstraped samples.", fig.cap="Characteristics of the samples from the HSCT bootstraped samples. Dispersion of the bootstraped samples on age and percentage of colon and controls samples. The bootstraps represent a wide range of different composition."}
include_graphics("images/hsct-bootstrap-characteristics.png")
```
Evaluating the same model on each bootstrap lead to a dispersion on the inner \acr{AVE} of the model.
The lower the dispersion, the more robust the model was to different conditions than in the initial testing.
```{r hsct-bootstrap, fig.scap="Bootstrap results from the HSCT dataset.", fig.cap="Bootstrap results from the HSCT dataset. Bootstrap of the models 0, 1.2 and 2.2 in the HSCT dataset. Model 0 and 1.2 have lower inner and outer AVE score, model 2.2 has lower outer score but higher inner value than the bootstrapped. (ref:points-boot) The dispersion is shown by the ellipses. The point with the black circle is the AVE of the original data."}
include_graphics("images/hsct_bootstrap_models.png")
```
With the bootstrapped models we used `BaseSet` to estimate the probability that each variable to be relevant for the association with a disease.
However, due to big amount of small probabilities when using the `BaseSet` package to calculate which variables are more relevant it could not provide a good estimation on time.
\acr{MCIA} was applied as a baseline of the integration, the first two components were represented similarly to those of the blocks when using \acr{RGCCA}, see figure \@ref(fig:hsct-mcia) of its outcome.
```{r hsct-mcia, fig.scap="MCIA dimensions in the HSCT dataset.", fig.cap="MCIA dimensions in the HSCT dataset. MCIA first two synthetic variables. The first variable separates the samples in colon and ileum. (ref:poins-col-disease)"}
include_graphics("images/hsct-mcia.png")
```
The \acr{AUC} of classifying the transcriptome in colon or ileum segments was compared between the models (see figure \@ref(fig:hsct-auc-plot) and with \acr{MCIA}).
```{r hsct-auc-plot, fig.scap="AUC of the RGCCA models in the HSCT dataset.", fig.cap="AUC of the RGCCA models in the HSCT dataset. The classification of the localization of the sample according to the first component of the gene expression of the models generated with RGCCA on the HSCT dataset. Models of family 2 classify much better the samples than models of family 1 or model 0."}
include_graphics("images/hsct-auc-models.png")
```
These models have the following \acr{AUC} to classify the location of the sample according to the first component of the transcriptome block.
```{r hsct-auc}
df <- data.frame(Model = c("0", "1", "1.1", "1.2", "2", "2.1", "2.2", "2.3"),
AUC = c(0.453703703703704, 0.430941358024691, 0.463927469135802, 0.595871913580247,
0.945023148148148, 0.998842592592593, 0.998070987654321, 0.996913580246914
))
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "AUC values of RGCCA models in the HSCT dataset.",
caption = "AUC values of RGCCA models of the HSCT dataset classyfing the location of the sample according to the first component of the transcriptome block. From model 0 to model 2.3, the best classification is achieved with model 2.1. Note: this is removing two samples for which the location is unkown.")
format_design(k, df)
```
On \acr{MCIA} the \acr{AUC} for the classification of ileum or colon samples is of 0.9851 once those two samples with unknown location are excluded.
This is on par with the models of family 2 as can be seen in the table \@ref(tab:hsct-auc).
The different models selected different variables as can be seen in figure \@ref(fig:hsct-features).
```{r hsct-features, fig.scap="Upset plot of variables selected from the HSCT dataset.", fig.cap="Upset plot of variables selected in the HSCT dataset. The variables selected on each model from 0 to 2.3 showing the intersection between them regarding genes, panel A, and OTUs, panel B. Genes are common on models of family 2, while OTUs are common on all models."}
include_graphics("images/hsct-features.png")
```
Differences and similarities between the selected features of each model can be observed in figure \@ref(fig:hsct-features).
Genes are very similar between model 0 to 1.2 and between 2 to 2.3, while \acr{OTUs} are very unique on model 0 and others shared between most models.
\FloatBarrier
### Häsler's dataset {#results-hasler}
If we look at the dataset 16S, the \acr{PCA} does not show a pattern as can be seen in figure \@ref(fig:hasler-pca-16s).
```{r hasler-pca-16s, fig.scap="PCA of 16S of the Häsler dataset.", fig.cap = "PCA of 16S of Häsler dataset. There are some samples that are very different from the others. (ref:poins-col-shape)"}
include_graphics("images/hasler_PCA_16S.png")
```
The \acr{PCA} on the transcriptome shows a distinction between colon and ileum according to the first dimension of the \acr{PCA} that can be seen in figure \@ref(fig:hasler-pca-RNAseq).
```{r hasler-pca-RNAseq, fig.scap="PCA of RNAseq of the Häsler dataset.", fig.cap = "PCA of RNAseq of the Häsler dataset. There are two clear groups according to the location on the first component. (ref:poins-col-shape)"}
include_graphics("images/hasler_PCA_RNAseq.png")
```
In this dataset, the parameter tau behaved slightly differently than with the previous dataset but the value from the Schäfer's method for tau was close to the best value.
```{r hasler-tau, fig.scap="Tau effect on the inner AVE in the Häsler dataset.", fig.cap="Tau effect on the inner AVE in the Häsler dataset. Changes on tau on the centroid scheme in the Häsler dataset affect the inner AVE score on the model 1. The panel A shows on the ordinate the RNAseq tau value, the panel B on the right, shows the tau of the microorganism; both of them show the y's tau on the abscissa."}
include_graphics("images/hasler-centroid_tau.png")
```
Models for Häsler dataset are the following:
```{r hasler-model-0}
m0 <- data.frame("Model 0" = c("RNAseq", "micro"),
RNAseq = c(0, 1),
micro = c(1, 0),
check.names = FALSE)
k <- kable(m0, align = rep("c", ncol(m0)),
caption.short = "Model 0 of the H\u00E4sler dataset.",
caption = "Model 0 of the H\u00E4sler dataset. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, m0)
```
The first model for family 1 is in table \@ref(tab:hasler-model-1-1).
```{r hasler-model-1-1}
df <- data.frame(
"Model 1.1" = c("RNAseq", "micro", "meta"),
RNAseq = c(0, 0, 1),
micro = c(0, 0, 1),
meta = c(1, 1, 0),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 1.1 of the H\u00E4sler dataset.",
caption = "Model 1.1 of the H\u00E4sler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNASeq and microbiota (micro) are related to the metavariables (meta).")
format_design(k, df)
```
The after optimization of the model of family 1, the best one is in table \@ref(tab:hasler-model-1-2).
```{r hasler-model-1-2}
df <- data.frame("Model 1.2" = c("RNAseq", "micro", "meta"),
RNAseq = c(0, 0, 1),
micro = c(0, 0, 0.1),
meta = c(1, 0.1, 0),
check.names = FALSE)
k <- kable(df,
align = rep("c", ncol(df)),
caption.short = "Model 1.2 of the H\u00E4sler dataset.",
caption = "Model 1.2 of the H\u00E4sler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNASeq and microbiota (micro) are related to the metavariables (meta), the former much more than the microbiota.")
format_design(k, df)
```
The first model for family 2 is in table \@ref(tab:hasler-model-2-1).
```{r hasler-model-2-1}
df <- data.frame(list(
"Model 2.1" = c("RNAseq", "micro", "Location", "Demographic", "Time")),
RNAseq = c(0, 0, 1, 0, 0),
micro = c(0, 0, 0.5, 1, 0),
Location = c(1, 0.5, 0, 0, 0),
Demographic = c(0, 1, 0, 0, 1),
Time = c(0, 0, 0, 1, 0),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.1 of the H\u00E4sler dataset.",
caption = "Model 2.1 of the H\u00E4sler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNAseq is related to the location, microbiota to location and demographic variables and demographic variables are also related to time.")
format_design(k, df, out = "scale_down")
```
After optimization of models of family 1 the best model according to the inner \acr{AVE} score is in table \@ref(tab:hasler-model-2-2).
```{r hasler-model-2-2}
df <- data.frame(list("Model 2.2" = c("RNAseq", "micro", "Location", "Demographic", "Time"),
RNAseq = c(0, 0.1, 1, 0, 0),
micro = c(0.1, 0, 0, 0.1, 1),
Location = c(1, 0, 0, 0, 0),
Demographic = c(0, 0.1, 0, 0, 0.1),
Time = c(0, 1, 0, 0.1, 0)), check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.2 of the H\u00E4sler dataset.",
caption = "Model 2.2 of the H\u00E4sler dataset. 0 indicates no relationship and 1 indicates a strong relationship. RNAseq is related to the microbiota and location (much stronger), the microbiota to the demographic and time (stronger) and demographic and time are linked too.")
format_design(k, df, out = "scale_down")
```
In table \@ref(tab:hasler-aves), we can see here the AVE scores of each of the previous models.
```{r hasler-aves}
df <- data.frame(Model = c("0", "1.1", "1.2", "2.1", "2.2"),
"inner AVE" = c(0.8217371, 0.7461423, 0.8349410, 0.4980681, 0.7513065),
"outer AVE" = c(0.0961236, 0.1024148, 0.1025486, 0.1008395, 0.1009915),
check.names = FALSE)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "AVE values of RGCCA models in the H\u00E4sler dataset.",
caption = "AVE values of RGCCA models in the H\u00E4sler dataset. The inner and the outer AVE scores of multiple models tested on the H\u00E4sler dataset are shown. The model with the hightest inner AVE is model 1.2 followed by model 2.2.")
format_design(k, df)
```
In contrast to the \acr{HSCT}'s dataset (table \@ref(tab:hsct-models-ave)), the model with the highest inner \acr{AVE} was model 1.2 but model 2.2 was close to it (see table \@ref(tab:hasler-aves)).
Model 2.2 has a relationship of 0.1 between microbiome and the host transcriptome and of 1 between the location and the host transcriptome.
The microbiome block is also related by a factor of 0.1 with the demographic block and of 1 with the time block.
Lastly, the time and the demographic block are related by a factor of a 0.1.
In either case the family 1 and family 2 models can correctly separate by sample location (colon or ileum) but not by disease type or inflammation status as can be seen in figure \@ref(fig:hasler-models).
```{r hasler-models, fig.scap="Models from inteRmodel in the Häsler's datset.", fig.cap="Models from inteRmodel in the Häsler's dataset. Model 0 with just the transcriptome and microbiome data. Models 1.1 to 1.2 with transcriptome, microbiome and sample data in a single block. Models 2.1 and 2.2 with transcriptome, microbiome and sample data in multiple blocks. On the A panel colored by disease on the B panel colored by location of the sample."}
include_graphics("images/hasler-models.png")
```
There is no observable cluster of \acr{IBD} samples and the other samples, showing that on this dataset the differences of the microbiome between the different type of samples are less stark.
The classification of samples was very accurate in all the models, specially on model 2.2, see figure \@ref(fig:hasler-auc-plot).
```{r hasler-auc-plot, fig.scap="AUC of the RGCCA models in the Häsler dataset", fig.cap="AUC of the RGCCA models in the Häsler dataset. The classification of the localization of the sample according to the first component of the gene expression of the models generated with RGCCA on the Häsler's dataset."}
include_graphics("images/hasler-auc.png")
```
This accuracy resulted on high \acr{AUC} values for all the models, as can be seen in table \@ref(tab:hasler-auc).
(ref:hasler-auc-s) AUC of the RGCCA models in Häsler's dataset.
(ref:hasler-auc) The AUC was calculated with the first dimension of the gene expression block ability to predict location of the sample. Models of family 2 reach 100% classification.
```{r hasler-auc}
df <- data.frame(
Model = c("0", "1.1", "1.2", "2.1", "2.2"),
AUC = c(0.801149425287356, 0.97816091954023, 0.997701149425287, 1, 1))
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "(ref:hasler-auc-s) ",
caption = "(ref:hasler-auc-s) (ref:hasler-auc)")
format_design(k, df)
```
\acr{MCIA} was applied as a baseline of the integration and compared to the different models to know which one separates best colon and ileum samples.
The result on the first two dimension is shown in figure \@ref(fig:hasler-mcia).
```{r hasler-mcia, fig.scap="MCIA dimensions in the Häsler dataset.", fig.cap="MCIA dimensions in the Häsler dataset. MCIA first two dimensions of the dataset shows two vertical groups on the first syntethic dimension according to the location of the samples and colored by the state. Each point represents a sample (colored by location and shape by State)."}
include_graphics("images/hasler-mcia.png")
```
\acr{MCIA}'s \acr{AUC} results was as high as the model 2.2 to classify samples according to their location.
It was even better to classify the samples according to the type of sample they are: 0.6248 vs 1 the best \acr{AUC} from \acr{RGCCA} that corresponds to model 1.2.
\FloatBarrier
### Morgan's dataset {#results-morgan}
To explore this dataset that is different from the others with \acr{IBD}.
The \acr{PCA} didn't show any pattern on the microbiome according to the location, see figure \@ref(fig:morgan-pca-16s).
```{r morgan-pca-16s, fig.scap="PCA of 16S in the Morgan dataset", fig.cap="PCA of 16S in the Morgan dataset. There seems to be quite a diverse microbiota based on the first dimension. (ref:poins-col-shape)"}
include_graphics("images/PCA_16S_morgan.png")
```
The \acr{PCA} on the transcriptome did not show a clear distinction between the \acr{PPI} and the pouch in figure \@ref(fig:morgan-pca-RNAseq) but there is a pattern.
```{r morgan-pca-RNAseq, fig.scap="PCA of RNAseq in the Morgan dataset", fig.cap="PCA of RNAseq in the Morgan dataset. There is no clear separation of the two locations on the first dimensions of the PCA. (ref:poins-col-shape)"}
include_graphics("images/PCA_RNASeq_morgan.png")
```
We tested if results of `inteRmodel` were consistent on this dataset with the other datasets.
The first model we tried is model 0 as in table \@ref(tab:morgan-model0).
```{r morgan-model0}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 0" = c("Transcriptome", "Microbiome"),
Transcriptome = c(0L, 1L),
Microbiome = c(1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 0 of the Morgan dataset.",
caption = "Relationships between the different blocks in the Morgan dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and microbiome are linked.")
format_design(k, df)
```
We then added the data about the samples as provided \@ref(methods-morgan), on a simple model as in table \@ref(tab:morgan-model1).
```{r morgan-model1}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1" = c("Transcriptome", "Microbiome", "metadata"),
Transcriptome = c(0L, 0L, 1L),
Microbiome = c(0L, 0L, 1L),
metadata = c(1L, 1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 1 of the Morgan dataset.",
caption = "Relationships between the different blocks in the Morgan dataset for model 1. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and microbiome are related to the metadata.")
format_design(k, df)
```
When looking for the model that adjust better following this structure we arrived to model 1.2, described in table \@ref(tab:morgan-model1-2).
(ref:morgan-model1-2) Model 1.2 of the Morgan dataset.
```{r morgan-model1-2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 1.2" = c("Transcriptome", "Microbiome", "metadata"),
Transcriptome = c(0, 0.1, 0),
Microbiome = c(0.1, 0, 1),
metadata = c(0L, 1L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "(ref:morgan-model1-2)",
caption = "(ref:morgan-model1-2) Relationships between the different blocks in the Morgan dataset for model 1.2. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome is related to the microbiome but weakly and the microbiome is also related to the metadata.")
format_design(k, df)
```
On model two we split the invariable variables from those related to the location (see table \@ref(tab:morgan-model2)).
```{r morgan-model2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 2" = c("Transcriptome",
"Microbiome","Demographic","Location"),
Transcriptome = c(0L, 1L, 1L, 1L),
Microbiome = c(1L, 0L, 1L, 1L),
Demographic = c(1L, 1L, 0L, 0L),
Location = c(1L, 1L, 0L, 0L),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2 of the Morgan dataset.",
caption = "Relationships between the different blocks in the Morgan dataset for model 2. 0 indicates no relationship and 1 indicates a strong relationship. Transcriptome and the microbiome are linked and related to demographic and location variables.")
format_design(k, df, out = "scale_down")
```
The model that has higher inner \acr{AVE} for these blocks is the following:
```{r morgan-model2-2}
df <- data.frame(
stringsAsFactors = FALSE,
"Model 2.2" = c("Transcriptome",
"Microbiome","Demographic","Location"),
Transcriptome = c(0, 0.1, 1, 0.1),
Microbiome = c(0.1, 0, 0, 0),
Demographic = c(1L, 0L, 0L, 0L),
Location = c(0.1, 0, 0, 0),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 2.2 of the Morgan dataset.",
caption = "Relationships between the different blocks in the Morgan dataset for model 2.2. 0 indicates no relationship and 1 indicates a strong relationship. The transcriptome is weakly related to the microbiome and location but much stronger related to the demographic variables.")
format_design(k, df, out = "scale_down")
```
Each model is different from previous models.
After model 2.2 we looked on the model similar to model 2.3 in the \acr{HSCT} dataset showed but it is the same as in model 2.2.
However, it is kept on the further analysis.
The different models were not able to separate the samples neither by location or sex.
```{r morgan-models, fig.scap="Models from inteRmodel in the Morgan dataset. ", fig.cap="Models from inteRmodel in the Morgan dataset. First component of the transcriptome and microbiome of models on the Morgan dataset. Model 0 without sample data. Model 1 to 1.2 with all the sample data in a single block and models 2.1 to 2.3 with sample data in several blocks. Panel A shows samples colored by sex and panel B by segment of the sample. There is no clear classification neither by location nor sex on any of the models."}
include_graphics("images/morgan_models.png")
```
Nevertheless, we compared the classification with the \acr{MCIA} algorithm and still resulted that model 2.2 provide a better classification than \acr{MCIA}.
```{r morgan-models-ave}
df <- data.frame(
Model = c(0, 1, 1.1, 1.2, 2, 2.1, 2.2, 2.3),
"inner AVE" = c(0.4735601,0.6333592,0.2448234,
0.7868443,0.4404123,0.6052598,0.6895661,0.6895661),
"outer AVE" = c(0.1098639,0.115228,0.1104746,
0.042266,0.108873,0.10749,0.1081315,0.1081315),
check.names = FALSE
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "AVE values of RGCCA for the Morgan dataset.",
caption = "AVE values of RGCCA for the Morgan dataset. The inner and the outer AVE scores of multiple models tested on the Morgan dataset are shown. The model with the highest inner AVE is model 1.2 followed by 2.2.")
format_design(k, df)
```
When exploring the bootstraps of the data we found that model 1.2 is highly variable:
```{r morgan-bootstrap, fig.scap="AVE scores of bootstrapped models from Morgan dataset.", fig.cap="AVE scores of bootstrapped models from Morgan dataset. Inner and outer AVE scores of the bootstrapped models 0, 1.2 and 2.2 on the Morgan dataset. Model 0 does not have sample data. Model 1.2 has microbiome, transcriptome and sample data in a single block and model 2.2 has microbiome, transcriptome and the sample data split in several blocks. Model 2.2 shows less variance than all models but lower inner values than model 1.2. (ref:points-boot) The dispersion is shown by the ellipses."}
include_graphics("images/morgan_bootstrap.png")
```
In addition the model 2.2 (table \@ref(tab:morgan-model2-2)) usually has a lower inner \acr{AVE} compared to model 1.2 (table \@ref(tab:morgan-model2)).
```{r morgan-auc-plot, fig.scap="AUC of the RGCCA models in the Morgan dataset", fig.cap="AUC of the RGCCA models in the Morgan dataset. The classification of the localization of the sample according to the first component of the gene expression of the models generated with RGCCA on the Morgan dataset."}
include_graphics("images/morgan-auc.png")
```
The area under the curve for these models are:
```{r morgan-auc}
df <- data.frame(Model = c("0", "1", "1.1", "1.2", "2", "2.1", "2.2", "2.3"),
AUC = c(0.496973365617433, 0.793497059840886, 0.816153580076098, 0.560619162919405,
0.85463507436873, 0.847371151850571, 0.835264614320304, 0.835264614320304
))
k <- kable(df, align = rep("c", ncol(df)),
caption.short="AUC for the Morgan dataset",
caption="AUC for the Morgan dataset classificating the localization of the sample according to the first component of the gene expression of the models generated with RGCCA.")
format_design(k, df)
```
With \acr{MCIA} we can observe that the location does not separate the samples (see figure \@ref(fig:morgan-mcia-plot)).
```{r morgan-mcia-plot, fig.scap="MCIA dimensions in the Morgan dataset.", fig.cap="MCIA dimensions in the Morgan dataset. MCIA first two dimensions of the dataset shows some separation between samples by location. (ref:poins-col-shape)"}
include_graphics("images/morgan-mcia.png")
```
If we quantify this separation by the first dimension of \acr{MCIA}, the \acr{AUC} is 0.818, which is slightly worst than the models of family 2.
\FloatBarrier
### Howell's dataset {#results-howell}
The \acr{16S} of this dataset doesn't show any clear pattern regarding the location of the samples according to the firsts dimensions of the \acr{PCA} on figure \@ref(fig:howell-pca-16s).
```{r howell-pca-16s, fig.scap="PCA of 16S data from the Howell dataset", fig.cap = "PCA of 16S data from the Howell dataset. The first component separates the samples by their location. (ref:poins-col-shape)"}
include_graphics("images/howell_PCA_16S.png")
```
The \acr{PCA} on the transcriptome shows a distinction between colon and ileum, see figure \@ref(fig:howell-pca-RNAseq), there are almost two distinct groups according to location.
```{r howell-pca-RNAseq, fig.scap="PCA of RNAseq data from the Howell dataset", fig.cap = "PCA of RNAseq data from the Howell dataset. There are two groups of samples according to their location. (ref:poins-col-shape)"}
include_graphics("images/howell_PCA_RNAseq.png")
```
This dataset was processed to confirm the results on the previous datasets.
As always first we started with model 0, see table \@ref(tab:howell-model0), connecting both the RNAseq and the \acr{16S} blocks.
```{r howell-model0}
df <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
`Model 0` = c("RNAseq", "16S"),
RNAseq = c(0L, 1L),
`16S` = c(1L, 0L)
)
k <- kable(df, align = rep("c", ncol(df)),
caption.short = "Model 0 of the Howell dataset.",
caption = "Relationships between the different blocks in the Howell dataset for model 0. 0 indicates no relationship and 1 indicates a strong relationship.")
format_design(k, df)
```
Later we look for the best model of family 1 (without looking at any previous model of family 1).
This resulted on the following model in table \@ref(tab:howell-model1-2).
```{r howell-model1-2}
df <- data.frame(
stringsAsFactors = FALSE,
check.names = FALSE,
`Model 1.2` = c("RNAseq", "16S", "metadata"),
RNAseq = c(0, 0.1, 1),
`16S` = c(0.1, 0, 0),
metadata = c(1L, 0L, 0L)
)