forked from Bioconductor/BiocWorkshops
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path210_Geistlinger_Omics.Rmd
826 lines (621 loc) · 31.6 KB
/
210_Geistlinger_Omics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
# 210: Functional enrichment analysis of high-throughput omics data
## Instructor names and contact information
* Ludwig Geistlinger (<Ludwig.Geistlinger@sph.cuny.edu>)
* Levi Waldron
CUNY School of Public Health
55 W 125th St, New York, NY 10027
## Workshop Description
This workshop gives an in-depth overview of existing methods for enrichment
analysis of gene expression data with regard to functional gene sets, pathways,
and networks.
The workshop will help participants understand the distinctions between
assumptions and hypotheses of existing methods as well as the differences in
objectives and interpretation of results.
It will provide code and hands-on practice of all necessary steps for differential
expression analysis, gene set- and network-based enrichment analysis, and
identification of enriched genomic regions and regulatory elements, along with
visualization and exploration of results.
### Pre-requisites
* Basic knowledge of R syntax
* Familiarity with the SummarizedExperiment class
* Familiarity with the GenomicRanges class
* Familiarity with high-throughput gene expression data as obtained with
microarrays and RNA-seq
* Familiarity with the concept of differential expression analysis
(with e.g. limma, edgeR, DESeq2)
### Workshop Participation
Execution of example code and hands-on practice
### _R_ / _Bioconductor_ packages used
* [EnrichmentBrowser](http://bioconductor.org/packages/EnrichmentBrowser)
* [regioneR](http://bioconductor.org/packages/regioneR)
* [airway](http://bioconductor.org/packages/airway)
* [ALL](http://bioconductor.org/packages/ALL)
* [hgu95av2.db](http://bioconductor.org/packages/hgu95av2.db)
* [BSgenome.Hsapiens.UCSC.hg19.masked](http://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19.masked)
### Time outline
| Activity | Time |
|---------------------------------------|------|
| Background | 30m |
| Differential expression analysis | 15m |
| Gene set analysis | 30m |
| Gene network analysis | 15m |
| Genomic region analysis | 15m |
## Goals and objectives
Theory:
* Gene sets, pathways & regulatory networks
* Resources
* Gene set analysis vs. gene set enrichment analysis
* Underlying null: competitive vs. self-contained
* Generations: ora, fcs & topology-based
Practice:
* Data types: microarray vs. RNA-seq
* Differential expression analysis
* Defining gene sets according to GO and KEGG
* GO/KEGG overrepresentation analysis
* Functional class scoring & permutation testing
* Network-based enrichment analysis
* Genomic region enrichment analysis
## Where does it all come from?
Test whether known biological functions or processes are over-represented
(= enriched) in an experimentally-derived gene list, e.g. a list of
differentially expressed (DE) genes. See
[Goeman and Buehlmann, 2007](https://doi.org/10.1093/bioinformatics/btm051) for
a critical review.
Example: Transcriptomic study, in which 12,671 genes have been tested for
differential expression between two sample conditions and 529 genes were found
DE.
Among the DE genes, 28 are annotated to a specific functional gene set, which
contains in total 170 genes. This setup corresponds to a 2x2 contingency table,
```{R}
deTable <-
matrix(c(28, 142, 501, 12000),
nrow = 2,
dimnames = list(c("DE", "Not.DE"),
c("In.gene.set", "Not.in.gene.set")))
deTable
```
where the overlap of 28 genes can be assessed based on the hypergeometric distribution.
This corresponds to a one-sided version of Fisher's exact test, yielding here a
significant enrichment.
```{R}
fisher.test(deTable, alternative = "greater")
```
This basic principle is at the foundation of major public and commercial enrichment
tools such as [DAVID](https://david.ncifcrf.gov/) and
[Pathway Studio](https://www.pathwaystudio.com).
Although gene set enrichment methods have been primarily developed and applied
on transcriptomic data, they have recently been modified, extended and applied
also in other fields of genomic and biomedical research. This includes novel
approaches for functional enrichment analysis of proteomic and metabolomic data
as well as genomic regions and disease phenotypes,
[Lavallee and Yates, 2016](https://doi.org/10.1002/0471250953.bi1328s53),
[Chagoyen et al., 2016](https://doi.org/10.1007/978-1-4939-3572-7_20),
[McLean et al., 2010](https://doi.org/10.1038/nbt.1630),
[Ried et al., 2012](https://doi.org/10.1002/gepi.21617).
## Gene expression-based enrichment analysis
The first part of the workshop is largely based on the
[EnrichmentBrowser](http://bioconductor.org/packages/EnrichmentBrowser)
package, which implements an analysis pipeline
for high-throughput gene expression data as measured with microarrays and
RNA-seq. In a workflow-like manner, the package brings together a selection of
established Bioconductor packages for gene expression data analysis. It
integrates a wide range of gene set enrichment analysis methods and facilitates
combination and exploration of results across methods.
<img src="https://github.com/waldronlab/BrownCOBRE2018/blob/master/notebooks_day2/EnrichmentBrowserWorkflow.png?raw=true", alt="EnrichmentBrowserWorkflow image", style="width:650px">
```{R}
suppressPackageStartupMessages(library(EnrichmentBrowser))
```
Further information can be found in the
[vignette](http://www.bioconductor.org/packages/release/bioc/vignettes/EnrichmentBrowser/inst/doc/EnrichmentBrowser.pdf)
and [publication](https://doi.org/10.1186/s12859-016-0884-1).
## A primer on terminology, existing methods & statistical theory
**Gene sets, pathways & regulatory networks**
Gene sets are simple lists of usually functionally related genes without further
specification of relationships between genes.
Pathways can be interpreted as specific gene sets, typically representing a
group of genes that
work together in a biological process. Pathways are commonly divided in
metabolic and signaling pathways.
Metabolic pathways such as glycolysis represent biochemical substrate conversions
by specific enzymes. Signaling pathways such as the MAPK signaling pathway describe
signal transduction cascades from receptor proteins to transcription factors,
resulting in activation or inhibition of specific target genes.
Gene regulatory networks describe the interplay and effects of regulatory
factors (such as transcription factors and microRNAs) on the expression of their
target genes.
**Resources**
[GO](http://www.geneontology.org) and [KEGG](http://www.genome.jp/kegg)
annotations are most frequently used for the enrichment analysis of
functional gene sets. Despite an increasing number of gene set and pathway
databases, they are typically the first choice due to their long-standing
curation and availability for a wide range of species.
*GO*: The Gene Ontology (GO) consists of three major sub-ontologies that classify
gene products according to molecular function (MF), biological process (BP) and
cellular component (CC). Each ontology consists of GO terms that define MFs, BPs
or CCs to which specific genes are annotated. The terms are organized in a
directed acyclic graph, where edges between the terms represent
relationships of different types. They relate the terms according to a parent-child
scheme, i.e. parent terms denote more general entities, whereas child terms represent
more specific entities.
*KEGG*: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of
manually drawn pathway maps representing molecular interaction and reaction networks.
These pathways cover a wide range of biochemical processes that can be divided in
7 broad categories: metabolism, genetic and environmental information processing,
cellular processes, organismal systems, human diseases, and drug development.
Metabolism and drug development pathways differ from pathways of the other 5
categories by illustrating reactions between chemical compounds.
Pathways of the other 5 categories illustrate molecular interactions between
genes and gene products.
**Gene set analysis vs. gene set enrichment analysis**
The two predominantly used enrichment methods are:
- Overrepresentation analysis (ORA), testing whether a gene set contains
disproportional many genes of significant expression change, based on the
procedure outlined in the first section
- Gene set enrichment analysis (GSEA), testing whether genes of a gene set
accumulate at the top or bottom of the full gene vector ordered by direction
and magnitude of expression change
[Subramanian et al., 2005](https://doi.org/10.1073/pnas.0506580102)
However, the term *gene set enrichment analysis* nowadays subsumes a general
strategy implemented by a wide range of methods
[Huang et al., 2009](https://doi.org/10.1093/nar/gkn923).
Those methods have in common the same goal, although approach and statistical
model can vary substantially
[Goeman and Buehlmann, 2007](https://doi.org/10.1093/bioinformatics/btm051),
[Khatri et al., 2012](https://doi.org/10.1371/journal.pcbi.1002375).
To better distinguish from the specific method, some authors use the term
*gene set analysis* to denote the general strategy.
However, there is also a specific method from
[Efron and Tibshirani, 2007](https://doi.org/10.1214/07-AOAS101) of this name.
**Underlying null: competitive vs. self-contained**
[Goeman and Buehlmann, 2007](https://doi.org/10.1093/bioinformatics/btm051)
classified existing enrichment methods into *competitive* and *self-contained*
based on the underlying null hypothesis.
- *Competitive* null hypothesis: the genes in the set of interest are at most as
often DE as the genes not in the set,
- *Self-contained* null hypothesis: no genes in the set of interest are DE.
Although the authors argue that a self-contained null is closer to the actual
question of interest, the vast majority of enrichment methods is competitive.
Goeman and Buehlmann further raise several critical issues concerning the 2x2 ORA:
- rather arbitrary classification of genes in DE / not DE
- based on gene sampling, although sampling of subjects is appropriate
- unrealistic independence assumption between genes, resulting in highly
anti-conservative *p*-values
With regard to these statistical concerns, GSEA is considered superior:
- takes all measured genes into account
- subject sampling via permutation of class labels
- the incorporated permutation procedure implicitly accounts for correlations
between genes
However, the simplicity and general applicability of ORA is unmet by subsequent
methods improving on these issues. For instance, GSEA requires the expression data
as input, which is not available for gene lists derived from other experiment types.
On the other hand, the involved sample permutation procedure has been proven
inaccurate and time-consuming
[Efron and Tibshirani, 2007](https://doi.org/10.1214/07-AOAS101),
[Phipson and Smyth, 2010](https://doi.org/10.2202/1544-6115.1585),
[Larson and Owen, 2015](https://doi.org/10.1186/s12859-015-0571-7).
**Generations: ora, fcs & topology-based**
[Khatri et al., 2012](https://doi.org/10.1371/journal.pcbi.1002375) have taken a
slightly different approach by classifying methods along the timeline of
development into three generations:
1. Generation: ORA methods based on the 2x2 contingency table test,
2. Generation: functional class scoring (FCS) methods such as GSEA, which compute
gene set (= functional class) scores by summarizing per-gene DE statistics,
3. Generation: topology-based methods, explicitly taking into account interactions
between genes as defined in signaling pathways and gene regulatory networks
([Geistlinger et al., 2011](https://doi.org/10.1093/bioinformatics/btr228) for an example).
Although topology-based (also: network-based) methods appear to be most realistic,
their straightforward application can be impaired by features that are not-detectable
on the transcriptional level (such as protein-protein interactions) and insufficient network knowledge
[Geistlinger et al., 2013](https://doi.org/10.1093/nar/gkt631),
[Bayerlova et al., 2015](https://doi.org/10.1186/s12859-015-0751-5).
Given the individual benefits and limitations of existing methods,
cautious interpretation of results is required to derive valid conclusions.
Whereas no single method is best suited for all application scenarios, applying
multiple methods can be beneficial.
This has been shown to filter out spurious hits of individual methods, thereby
reducing the outcome to gene sets accumulating evidence from different methods
[Geistlinger et al., 2016](https://doi.org/10.1186/s12859-016-0884-1),
[Alhamdoosh et al., 2017](https://doi.org/10.1093/bioinformatics/btw623).
## Data types
Although RNA-seq (read count data) has become the *de facto* standard for
transcriptomic profiling, it is important to know that many methods for
differential expression and gene set enrichment analysis have been originally
developed for microarray data (intensity measurements).
However, differences in data distribution assumptions (microarray: quasi-normal,
RNA-seq: negative binomial) made adaptations in differential expression analysis
and, to some extent, also in gene set enrichment analysis necessary.
Thus, we consider two example datasets - a microarray and a RNA-seq dataset,
and discuss similarities and differences of the respective analysis steps.
For microarray data, we consider expression measurements of patients with acute
lymphoblastic leukemia
[Chiaretti et al., 2004](https://doi.org/10.1182/blood-2003-09-3243). A
frequent chromosomal defect found among these patients is a translocation, in
which parts of chromosome 9 and 22 swap places. This results in the oncogenic
fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a
part of the BCR gene on chromosome 22.
We load the
[ALL](http://bioconductor.org/packages/ALL)
dataset
```{R}
library(ALL)
data(ALL)
```
and select B-cell ALL patients with and without the BCR/ABL fusion, as described previously
[Gentleman et al., 2005](https://www.bioconductor.org/help/publications/books/bioinformatics-and-computational-biology-solutions).
```{R}
ind.bs <- grep("^B", ALL$BT)
ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG"))
sset <- intersect(ind.bs, ind.mut)
all.eset <- ALL[, sset]
```
We can now access the expression values, which are intensity measurements
on a log-scale for 12,625 probes (rows) across 79 patients (columns).
```{R}
dim(all.eset)
exprs(all.eset)[1:4,1:4]
```
As we often have more than one probe per gene, we compute gene expression values
as the average of the corresponding probe values.
```{R}
allSE <- probe2gene(all.eset)
head(names(allSE))
```
For RNA-seq data, we consider transcriptome profiles of four primary human
airway smooth muscle cell lines in two conditions: control and treatment with
dexamethasone
[Himes et al., 2014](https://doi.org/10.1371/journal.pone.0099625).
We load the
[airway](http://bioconductor.org/packages/airway)
dataset
```{R}
library(airway)
data(airway)
```
For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.
```{R}
airSE <- airway[grep("^ENSG", names(airway)), ]
dim(airSE)
```
```{R}
assay(airSE)[1:4,1:4]
```
## Differential expression analysis
Normalization of high-throughput expression data is essential to make results
within and between experiments comparable. Microarray (intensity measurements)
and RNA-seq (read counts) data typically show distinct features that need to be
normalized for. As this is beyond the scope of this workshop, we refer to
[limma](http://bioconductor.org/packages/limma)
for microarray normalization and
[EDASeq](http://bioconductor.org/packages/EDASeq)
for RNA-seq normalization. See also `EnrichmentBrowser::normalize`, which wraps
commonly used functionality for normalization.
The EnrichmentBrowser incorporates established functionality from the
[limma](http://bioconductor.org/packages/limma)
package for differential expression analysis.
This involves the `voom` transformation when applied to RNA-seq data.
Alternatively, differential expression analysis for RNA-seq data can also be
carried out based on the negative binomial distribution with
[edgeR](http://bioconductor.org/packages/edgeR)
and
[DESeq2](http://bioconductor.org/packages/DESeq2).
This can be performed using the function `EnrichmentBrowser::deAna`
and assumes some standardized variable names:
- **GROUP** defines the sample groups being contrasted,
- **BLOCK** defines paired samples or sample blocks, as e.g. for batch effects.
For more information on experimental design, see the
[limma user's guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf),
chapter 9.
For the ALL dataset, the **GROUP** variable indicates whether the BCR-ABL gene
fusion is present (1) or not (0).
```{R}
allSE$GROUP <- ifelse(allSE$mol.biol == "BCR/ABL", 1, 0)
table(allSE$GROUP)
```
For the airway dataset, it indicates whether the cell lines have been treated
with dexamethasone (1) or not (0).
```{R}
airSE$GROUP <- ifelse(colData(airway)$dex == "trt", 1, 0)
table(airSE$GROUP)
```
Paired samples, or in general sample batches/blocks, can be defined via a
`BLOCK` column in the `colData` slot. For the airway dataset, the sample blocks
correspond to the four different cell lines.
```{R}
airSE$BLOCK <- airway$cell
table(airSE$BLOCK)
```
For microarray data, the `EnrichmentBrowser::deAna` function carries out
differential expression analysis based on functionality from the
*limma* package. Resulting log2 fold changes and *t*-test derived
*p*-values for each gene are appended to the `rowData` slot.
```{R}
allSE <- deAna(allSE)
rowData(allSE, use.names=TRUE)
```
Nominal *p*-values are already corrected for multiple testing (`ADJ.PVAL`)
using the method from Benjamini and Hochberg implemented in `stats::p.adjust`.
For RNA-seq data, the `deAna` function can be used to carry out differential
expression analysis between the two groups either based on functionality from
*limma* (that includes the `voom` transformation), or
alternatively, the frequently used *edgeR* or *DESeq2*
package. Here, we use the analysis based on *edgeR*.
```{R}
airSE <- deAna(airSE, de.method="edgeR")
```
```{R}
rowData(airSE, use.names=TRUE)
```
*Exercise:* Compare the number of differentially expressed genes as obtained on the `airSE` with `limma/voom`, `edgeR`, and `DESeq2`.
## Gene sets
We are now interested in whether pre-defined sets of genes that are known to
work together, e.g. as defined in the [Gene Ontology](http://www.geneontology.org)
or the [KEGG](http://www.genome.jp/kegg) pathway annotation, are coordinately
differentially expressed.
The function `getGenesets` can be used to download gene sets from databases such
as GO and KEGG.
Here, we use the function to download all KEGG pathways for a chosen organism
(here: \emph{Homo sapiens}) as gene sets.
```{R}
kegg.gs <- getGenesets(org="hsa", db="kegg")
```
Analogously, the function `getGenesets` can be used to retrieve GO terms of a
selected ontology (here: biological process, BP) as defined in the *GO.db*
annotation package.
```{R}
go.gs <- getGenesets(org="hsa", db="go", go.onto="BP", go.mode="GO.db")
```
If provided a file, the function `getGenesets` parses user-defined gene sets
from GMT file format.
Here, we use this functionality for reading a list of already downloaded
KEGG gene sets for *Homo sapiens* containing NCBI Entrez Gene IDs.
```{R}
data.dir <- system.file("extdata", package="EnrichmentBrowser")
gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt")
hsa.gs <- getGenesets(gmt.file)
hsa.gs[1:2]
```
```{R}
hsa.gs[1:2]
```
See also the [MSigDb](http://software.broadinstitute.org/gsea/msigdb) for
additional gene set collections.
## GO/KEGG overrepresentation analysis
A variety of gene set analysis methods have been proposed
[Khatri et al., 2012](https://doi.org/10.1371/journal.pcbi.1002375).
The most basic, yet frequently used, method is the over-representation analysis
(ORA) with gene sets defined according to GO or KEGG.
As outlined in the first section, ORA tests the overlap between DE genes
(typically DE *p*-value < 0.05) and genes in a gene set based on the
hypergeometric distribution.
Here, we choose a significance level $\alpha = 0.2$ for demonstration.
```{R}
ora.all <- sbea(method="ora", se=allSE, gs=hsa.gs, perm=0, alpha=0.2)
gsRanking(ora.all)
```
Such a ranked list is the standard output of most existing enrichment tools.
Using the `eaBrowse` function creates a HTML summary from which each
gene set can be inspected in more detail.
<img src="https://github.com/waldronlab/BrownCOBRE2018/blob/master/notebooks_day2/EnrichmentBrowserNavigation.png?raw=true", alt="EnrichmentBrowserNavigation image", style="width:600px">
```{R}
eaBrowse(ora.all)
```
The resulting summary page includes for each significant gene set
- a gene report, which lists all genes of a set along with fold change and DE
$p$-value (click on links in column `NR.GENES`),
- interactive overview plots such as heatmap and volcano plot (column
`SET.VIEW`, supports mouse-over and click-on),
- for KEGG pathways: highlighting of differentially expressed genes on the
pathway maps (column `PATH.VIEW`, supports mouse-over and click-on).
As ORA works on the list of DE genes and not the actual expression values, it
can be straightforward applied to RNA-seq data. However, as the gene sets here
contain NCBI Entrez gene IDs and the airway dataset contains ENSEMBL gene ids,
we first map the airway dataset to Entrez IDs.
```{R}
airSE <- idMap(airSE, org="hsa", from="ENSEMBL", to="ENTREZID")
```
```{R}
ora.air <- sbea(method="ora", se=airSE, gs=hsa.gs, perm=0)
gsRanking(ora.air)
```
Note #1: [Young et al., 2010](https://doi.org/10.1186/gb-2010-11-2-r14), have
reported biased results for ORA on RNA-seq data due to over-detection of
differential expression for long and highly expressed transcripts. The
[goseq](http://bioconductor.org/packages/goseq)
package and `limma::goana` implement possibilities to adjust ORA for gene length
and abundance bias.
Note #2: Independent of the expression data type under investigation, overlap
between gene sets can result in redundant findings. This is well-documented for
GO (parent-child structure,
[Rhee et al., 2008](https://doi.org/10.1038/nrg2363)) and KEGG (pathway
overlap/crosstalk,
[Donato et al., 2013](https://doi.org/10.1101/gr.153551.112)). The
[topGO](http://bioconductor.org/packages/topGO)
package (explicitly designed for GO) and
[mgsa](http://bioconductor.org/packages/mgsa)
(applicable to arbitrary gene set definitions) implement
modifications of ORA to account for such redundancies.
## Functional class scoring & permutation testing
A major limitation of ORA is that it restricts analysis to DE genes, excluding
genes not satisfying the chosen significance threshold (typically the vast
majority).
This is resolved by gene set enrichment analysis (GSEA), which scores the
tendency of gene set members to appear rather at the top or bottom of the
ranked list of all measured genes
[Subramanian et al., 2005](https://doi.org/10.1073/pnas.0506580102). The
statistical significance of the enrichment score (ES) of a gene set is assessed
via sample permutation, i.e. (1) sample labels (= group assignment) are
shuffled, (2) per-gene DE statistics are recomputed, and (3) the enrichment
score is recomputed. Repeating this procedure many times allows to determine
the empirical distribution of the enrichment score and to compare the observed
enrichment score against it. Here, we carry out GSEA with 1000 permutations.
```{R}
gsea.all <- sbea(method="gsea", se=allSE, gs=hsa.gs, perm=1000)
```
```{R}
gsRanking(gsea.all)
```
As GSEA's permutation procedure involves re-computation of per-gene DE
statistics, adaptations are necessary for RNA-seq. The EnrichmentBrowser
implements an accordingly adapted version of GSEA, which allows incorporation
of limma/voom, edgeR, or DESeq2 for repeated DE re-computation within GSEA.
However, this is computationally intensive (for limma/voom the least, for
DESeq2 the most). Note the relatively long running times for only 100
permutations having used edgeR for DE analysis.
```{R}
gsea.air <- sbea(method="gsea", se=airSE, gs=hsa.gs, perm=100)
```
While it might be in some cases necessary to apply permutation-based GSEA for
RNA-seq data, there are also alternatives avoiding permutation. Among them is
ROtAtion gene Set Testing (ROAST), which uses rotation instead of permutation
[Wu et al., 2010](https://doi.org/10.1093/bioinformatics/btq401).
```{R}
roast.air <- sbea(method="roast", se=airSE, gs=hsa.gs)
gsRanking(roast.air)
```
A selection of additional methods is also available:
```{R}
sbeaMethods()
```
*Exercise*: Carry out a GO overrepresentation analysis for the `allSE` and `airSE`. How many significant gene sets do you observe in each case?
## Network-based enrichment analysis
Having found gene sets that show enrichment for differential expression,
we are now interested whether these findings can be supported by known
regulatory interactions.
For example, we want to know whether transcription factors and their target
genes are expressed in accordance to the connecting regulations
(activation/inhibition).
Such information is usually given in a gene regulatory network derived from
specific experiments or compiled from the literature
([Geistlinger et al., 2013](https://doi.org/10.1093/nar/gkt631) for an example).
There are well-studied processes and organisms for which comprehensive and
well-annotated regulatory networks are available, e.g. the
[RegulonDB](http://regulondb.ccg.unam.mx) for *E. coli* and
[Yeastract](http://www.yeastract.com) for *S. cerevisiae*.
However, there are also cases where such a network is missing or at least
incomplete.
A basic workaround is to compile a network from regulations in pathway databases
such as KEGG.
```{R}
hsa.grn <- compileGRN(org="hsa", db="kegg")
head(hsa.grn)
```
Signaling pathway impact analysis (SPIA) is a network-based enrichment analysis
method, which is explicitly designed for KEGG signaling pathways
[Tarca et al., 2009](https://doi.org/ 10.1093/bioinformatics/btn577). The
method evaluates whether expression changes are propagated across the pathway
topology in combination with ORA.
```{R}
spia.all <- nbea(method="spia", se=allSE, gs=hsa.gs, grn=hsa.grn, alpha=0.2)
gsRanking(spia.all)
```
More generally applicable is gene graph enrichment analysis (GGEA), which
evaluates consistency of interactions in a given gene regulatory network with
the observed expression data
[Geistlinger et al., 2011](https://doi.org/10.1093/bioinformatics/btr228).
```{R}
ggea.all <- nbea(method="ggea", se=allSE, gs=hsa.gs, grn=hsa.grn)
gsRanking(ggea.all)
```
```{R}
nbeaMethods()
```
Note #1: As network-based enrichment methods typically do not involve sample
permutation but rather network permutation,
thus avoiding DE re-computation, they can likewise be applied to RNA-seq data.
Note #2: Given the various enrichment methods with individual benefits and
limitations, combining multiple methods can be beneficial, e.g. combined
application of a set-based and a network-based method. This has been shown to
filter out spurious hits of individual methods and to reduce the outcome to
gene sets accumulating evidence from different methods
[Geistlinger et al., 2016](https://doi.org/10.1186/s12859-016-0884-1),
[Alhamdoosh et al., 2017](https://doi.org/10.1093/bioinformatics/btw623).
The function `combResults` implements the straightforward combination of
results, thereby facilitating seamless comparison of results across methods.
For demonstration, we use the ORA and GSEA results for the ALL dataset from the
previous section:
```{R}
res.list <- list(ora.all, gsea.all)
comb.res <- combResults(res.list)
gsRanking(comb.res)
```
*Exercise:* Carry out `SPIA` and `GGEA` for the `airSE` and combine the results. How many gene sets are rendered significant by both methods?
## Genomic region enrichment analysis
Microarrays and next-generation sequencing are also widely applied for
large-scale detection of variable and regulatory genomic regions, e.g. single
nucleotide polymorphisms, copy number variations, and transcription factor
binding sites.
<img src="https://github.com/waldronlab/BrownCOBRE2018/blob/master/notebooks_day2/ENCODE.png?raw=true", alt="ENCODE image", style="width:600px">
Such experimentally-derived genomic region sets are raising similar questions
regarding functional enrichment as in gene expression data analysis.
Of particular interest is thereby whether experimentally-derived regions
overlap more (enrichment) or less (depletion) than expected by chance with
regions representing known functional features such as genes or
promoters.
The
[regioneR](http://bioconductor.org/packages/regioneR)
package implements a general framework for testing overlaps of genomic regions
based on permutation sampling.
This allows to repeatedly sample random regions from the genome, matching size
and chromosomal distribution of the region set under study.
By recomputing the overlap with the functional features in each permutation,
statistical significance of the observed overlap can be assessed.
```{R}
suppressPackageStartupMessages(library(regioneR))
```
To demonstrate the basic functionality of the package, we consider the overlap
of gene promoter regions and CpG islands in the human genome. We expect to find
an enrichment as promoter regions are known to be GC-rich. Hence, is the
overlap between CpG islands and promoters greater than expected by
chance?
We use the collection of CpG islands described in
[Wu et al., 2010](https://doi.org/10.1093/biostatistics/kxq005) and restrict
them to the set of canonical chromosomes 1-23, *X*, and *Y*.
```{R}
cpgHMM <- toGRanges("http://www.haowulab.org/software/makeCGI/model-based-cpg-islands-hg19.txt")
cpgHMM <- filterChromosomes(cpgHMM, chr.type="canonical")
cpgHMM <- sort(cpgHMM)
cpgHMM
```
Analogously, we load promoter regions in the *hg19* human genome assembly as
available from [UCSC](https://genome.ucsc.edu/):
```{R}
promoters <- toGRanges("http://gattaca.imppc.org/regioner/data/UCSC.promoters.hg19.bed")
promoters <- filterChromosomes(promoters, chr.type="canonical")
promoters <- sort(promoters)
promoters
```
To speed up the example, we restrict analysis to chromosomes 21 and 22. Note
that this is done for demonstration only. To make an accurate claim, the
complete region set should be used (which, however, runs considerably longer).
```{R}
cpg <- cpgHMM[seqnames(cpgHMM) %in% c("chr21", "chr22")]
prom <- promoters[seqnames(promoters) %in% c("chr21", "chr22")]
```
Now, we are applying an overlap permutation test with 100 permutations
(`ntimes=100`), while maintaining chromosomal distribution of the CpG island
region set (`per.chromosome=TRUE`). Furthermore, we use the option
`count.once=TRUE` to count an overlapping CpG island only once, even if it
overlaps with 2 or more promoters.
Note that we use 100 permutations for demonstration only.
To draw robust conclusions a minimum of 1000 permutations should be carried out.
```{R}
pt <- overlapPermTest(cpg, prom, genome="hg19", ntimes=100, per.chromosome=TRUE, count.once=TRUE)
pt
```
```{R}
summary(pt[[1]]$permuted)
```
The resulting permutation *p*-value indicates a significant enrichment. Out of
the 2859 CpG islands, 719 overlap with at least one
promoter. In contrast, when repeatedly drawing random regions matching the CpG
islands in size and chromosomal distribution, the mean number of overlapping
regions across permutations was 117.7
$\pm$ 11.8.
Note #1: The function `regioneR::permTest` allows to incorporate user-defined
functions for randomizing regions and evaluating additional measures of overlap
such as total genomic size in bp.
Note #2: The
[LOLA](http://bioconductor.org/packages/LOLA)
package implements a genomic region ORA, which assesses genomic region overlap
based on the hypergeometric distribution using a library of pre-defined functional
region sets.