-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReports_communites_ES.Rmd
354 lines (269 loc) · 18 KB
/
Reports_communites_ES.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "Deconvolution of gene expression Ewing Sarcoma datasets"
author: "Jane Merlevede & Andrei Zinovyev"
date: "`r format(Sys.time(), '%m/%d/%y')`"
output:
bookdown::html_document2:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 4
number_sections: true
df_print: paged
---
```{r variable_definition, echo=FALSE}
#Options to be defined by the user
tumor_type="ES"
nPerm=10000 #Nb permutations for gsea
top_top_contributing_genes=5 #Nb of top top-contributing genes to show in table
```
```{r file_loading, echo=FALSE}
#input files to provide
weighted_metagenes = read.table(paste0(tumor_type,"_weighted_metagenes"), header = TRUE, sep="\t") #weighted metagenes computed in mcl_and_weighted_metagenes.R
datasets_info = read.table("../datasets_info", header = TRUE, sep="\t", as.is=TRUE) #info about the datasets
community_comp = read.table(paste0(tumor_type,"_communities"), header = TRUE, sep="\t") #obtained from mcl_and_weighted_metagenes.R
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r package_loading, echo=FALSE, message = FALSE}
library("stringr")
library("knitr")
library("clusterProfiler")
library(GOSemSim)
library("enrichplot")
library("vroom")
library("DOSE")
library("ggplot2")
library("cowplot")
library("DT")
```
# Introduction
We described here the analyses performed on `r tumor_type` gene expression datasets. Our goal is to identify robust (across different datasets) mechanims active in `r tumor_type` tumors.
This report described in details the communities identified: their component names, their techno type (bulk or single), their data type (patient, PDX, cell line, controls) and the significantly enriched gene sets. It also briefly describes the main methodological steps of the approach.
The inputs needed are:
* weighted metagenes previously computed
* the community composition obtained after the computation of weighted metagenes
* information about the datasets
This work is in the scope of D3.1 iPC deliverable.
Here we focused on Ewing sarcoma, for which we gathered 20 datasets:
* 5 bulk datasets of patients
* 4 single cell cell lines
* 8 single cell PDX
* 3 single cell controls (including 2 full embryos GSE137804)
```{r dataset_infos, echo=FALSE}
datasets_info_ES = subset(datasets_info, tumor_type=="ES")
datasets_info_ctrl=subset(datasets_info, tumor_type=="Ctrl")
data_infos=rbind(datasets_info_ES, datasets_info_ctrl)
data_infos=data_infos[,-1]
kable(data_infos[,c(4,1:3)])
```
# Description of the communities
## Community composition
In `r tumor_type` datasets, `r nrow(community_comp)` communities containing at least 3 distinct datasets were identified. The number of components per community and the community composition are shown below.
```{r table_community_composition, echo=FALSE}
barplot(rowSums(!is.na(community_comp)), xlab="Community", ylab="Nb of components", main="Nb of components per community", names.arg = 1:nrow(community_comp))
datatable(community_comp)
```
Now, we want to characterize these communities by the nature of the components they contain and by the mechanims they represent.
## Nature of the communities
First, for each community, we describe how many components they contain and from how many unique datatsets they come from.
We also show how many bulk and single cell data were gathered. The sample type, patients, cell lines, PDX and controls is also given and is a first clue for the interpretation.
```{r datasets_info, echo=FALSE}
community_stats=as.data.frame(matrix(0, nrow = nrow(community_comp), ncol = 0))
community_stats$community=paste0("C",1:nrow(community_comp))
community_stats$nb_comp = rowSums(!is.na(community_comp))
for (c in 1:nrow(community_comp))
{
current_community = community_comp[c,!is.na(community_comp[c,])]
comp_name=rep(0,length(current_community))
for (k in 1:length(current_community))
{
comp_name[k] = unlist(str_split(current_community[1,k],"_metagene"))[1]
}
ind_comp_current_community = which(datasets_info$Acronym %in% comp_name)
datasets_info_current_community = datasets_info[ind_comp_current_community,]
community_stats$nb_comp_unique[c] = length(unique(comp_name))
community_stats$nb_bulk[c]= sum(datasets_info_current_community$Techno=="bulk")
community_stats$nb_sc[c]= sum(datasets_info_current_community$Techno=="single-cell")
community_stats$nb_patient[c] = sum(datasets_info_current_community$data_type=="Patient")
community_stats$nb_cellline[c] = sum(datasets_info_current_community$data_type=="cell-line")
community_stats$nb_PDX[c] = sum(datasets_info_current_community$data_type=="PDX")
community_stats$nb_control[c] = sum(datasets_info_current_community$data_type=="control")
}
kable(community_stats)
```
```{r datasets_info_to_mention_in_text, echo=FALSE, message=FALSE, warning=FALSE}
nb_bulk_only=length(which(community_stats$nb_sc == 0))
nb_sc_only=length(which(community_stats$nb_bulk == 0))
nb_no_control=length(which(community_stats$nb_control == 0))
```
`r nrow(community_stats)-nb_bulk_only-nb_sc_only` out of `r nrow(community_stats)` communities are a mixture of bulk and single cell data, `r nb_bulk_only` contains only bulk and `r nb_sc_only` contains only single cell datasets.
`r nb_no_control` out of `r nrow(community_stats)` communities does not contain any control. As we want to focuse on tumor related signals, we can look in particular at communities not containing one component from control datasets. Thus we further focuse on `r nb_no_control` communities, which are communities `r which(community_stats$nb_control == 0)`.
## Top-contributing genes
The top-contributing genes (genes with the hightest and smallest weights) are shown below for each component.
```{r table_contributing_genes, echo=FALSE, message=FALSE, warning=FALSE}
#for each weighted metagenes of interest, top-contributing genes are the genes beyond 3 SD from the mean
Mean=colMeans(weighted_metagenes, na.rm = TRUE)
SD=apply(weighted_metagenes,2,sd, na.rm = TRUE)
nb_pos_tail_3SD=nb_neg_tail_3SD=rep(0,ncol(weighted_metagenes))
scores_pos_tail_3SD=scores_neg_tail_3SD=rep(0,ncol(weighted_metagenes))
fn = paste0(tumor_type,"_top_top_contributing_genes")
if (file.exists(fn,showWarnings = FALSE)) {
#Delete file if it exists
file.remove(fn)
}
for (k in 1:ncol(weighted_metagenes))
{
ind_pos_tail_3SD=which(weighted_metagenes[,k]>Mean[k]+3*SD[k])
ind_neg_tail_3SD=which(weighted_metagenes[,k]<Mean[k]-3*SD[k])
genes_pos_tail= rownames(weighted_metagenes[ind_pos_tail_3SD,])
genes_neg_tail= rownames(weighted_metagenes[ind_neg_tail_3SD,])
genes=c(genes_pos_tail, genes_neg_tail)
scores=c(weighted_metagenes[ind_pos_tail_3SD,k], weighted_metagenes[ind_neg_tail_3SD,k])
top_contributing_genes=data.frame(cbind(paste0("C",k), genes, scores))
write.table(top_contributing_genes , fn, sep="\t", row.names = FALSE, append=TRUE, col.names=!file.exists( fn), quote=FALSE)
}
top_contributing_genes = read.table(fn, header = TRUE, sep="\t", as.is=TRUE)
table_top_contributing_genes=data.frame(Community=numeric(nrow(community_comp)), Driver_pos_tail=numeric(nrow(community_comp)), Driver_neg_tail=numeric(nrow(community_comp)))
for (i in 1:nrow(community_comp))
{
community = subset(top_contributing_genes, V1==paste0("C",i))
community_pos = community[which(community$scores>=0),]
community_neg = community[which(community$scores<0),]
community_pos = community_pos[order(community_pos$scores,decreasing=TRUE),]
community_neg = community_pos[order(community_neg$scores,decreasing=FALSE),]
table_top_contributing_genes$Community[i] <- paste0("C",i)
table_top_contributing_genes$Driver_pos_tail[i] = paste0(community_pos$genes[1:top_top_contributing_genes], collapse=";")
table_top_contributing_genes$Driver_neg_tail[i] = paste0(community_neg$genes[1:top_top_contributing_genes], collapse=";")
}
datatable(table_top_contributing_genes, rownames=FALSE)
```
# Annotation of the communities using GO, KEGG, MSigDB and DO
To decipher the mechnanisms active in the communities, we extracted the top-contributing genes in each community and performed over representation analysis (hypergeometric tests) as well as gene set enrichment analyses (Kolmogorov tests).
## Over representation analysis of the top-contributing genes and gene set enrichment analysis for each community
Now we can describe each component in details. We used GO, KEGG, MSigDB (http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) and DO ontology to annotate the communities. Results are represented in dotplots, cnetplots and emaplots for both drivers of the positive and negative tails using clusterProfiler (https://bookdown.org/yihui/rmarkdown/html-document.html). If the OR analysis had no enrichment in one or the two tails, no plot are reported.
We defined top-contributing genes as genes having a weight below and beyond 3 sd +/- mean of the weights for each community.
```{r communities_OR_details, echo=FALSE, message=FALSE, warning = FALSE,fig.fullwidth = TRUE, fig.width=12}
gcSample_pos = list()
gcSample_neg = list()
Mean=colMeans(weighted_metagenes, na.rm = TRUE)
SD=apply(weighted_metagenes,2,sd, na.rm = TRUE)
folder_MSigDB_gene_sets = "gene_sets/MSigDB/"
gene_sets = list.files(folder_MSigDB_gene_sets, full.names = TRUE)
egmt_pos_combine=data.frame(matrix(nrow=0,ncol=10))
egmt_neg_combine=data.frame(matrix(nrow=0,ncol=10))
d <- godata('org.Hs.eg.db', ont="BP")
out <- NULL
for (i in 1:nrow(community_comp)) #nrow(community_comp)
{
out <- c(out, knit_child('template_pathway_analysis.Rmd', quiet = TRUE))
}
```
`r paste(out, collapse='\n')`
## Overview of over representation analysis in top-contributing genes of all communities
We provide here an overview of the top-contributing genes coming from the positive and negative tails for all components for GO, KEGG and DO annotations.
```{r communities_overview, echo=FALSE, message = FALSE, warning = FALSE, fig.height=30, fig.width=40}
#, fig.cap=c("KEGG overrepresentation of positive tail", "KEGG overrepresentation of negative tail", "GO overrepresentation of positive tail", "GO overrepresentation of negative tail", "DO overrepresentation of positive tail", "DO overrepresentation of negative tail")
#,fig.fullwidth = TRUE
gcSample_pos = list()
gcSample_neg = list()
Mean=colMeans(weighted_metagenes, na.rm = TRUE)
SD=apply(weighted_metagenes,2,sd, na.rm = TRUE)
for (i in 1:nrow(community_comp)) #nrow(community_comp)
{
## feature 1: numeric vector
geneList <- weighted_metagenes[,i]
## feature 2: named vector
names(geneList) <- rownames(weighted_metagenes)
## feature 3: decreasing order
geneList <- sort(geneList, decreasing = TRUE)
#data(geneList, package="DOSE")
#define top-contributing genes
ind_pos_tail_3SD=which(weighted_metagenes[,i]>Mean[i]+3*SD[i])
ind_neg_tail_3SD=which(weighted_metagenes[,i]<Mean[i]-3*SD[i])
genes_pos_tail= rownames(weighted_metagenes[ind_pos_tail_3SD,])
genes_neg_tail= rownames(weighted_metagenes[ind_neg_tail_3SD,])
genes=c(genes_pos_tail, genes_neg_tail)
gene_pos_tail = weighted_metagenes[ind_pos_tail_3SD,i]
gene_neg_tail = weighted_metagenes[ind_neg_tail_3SD,i]
names(gene_pos_tail) = rownames(weighted_metagenes[ind_pos_tail_3SD,])
names(gene_neg_tail) = rownames(weighted_metagenes[ind_neg_tail_3SD,])
gene_pos_tail_temp = sort(gene_pos_tail, decreasing = TRUE)
gene_neg_tail_temp = sort(gene_neg_tail, decreasing = TRUE)
gene_pos_tail_list = bitr(names(gene_pos_tail_temp), fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
gene_neg_tail_list = bitr(names(gene_neg_tail_temp), fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
key_pos = paste0("C",i,"_p")
key_neg = paste0("C",i,"_n")
gcSample_pos[[ key_pos ]] <- gene_pos_tail_list$ENTREZID
gcSample_neg[[ key_pos ]] <- gene_neg_tail_list$ENTREZID
}
ck_pos <- compareCluster(geneCluster = gcSample_pos, fun = "enrichKEGG")
ck_neg <- compareCluster(geneCluster = gcSample_neg, fun = "enrichKEGG")
cGO_pos <- compareCluster(geneCluster = gcSample_pos, fun = "enrichGO", OrgDb = org.Hs.eg.db)
cDO_pos <- compareCluster(geneCluster = gcSample_pos, fun = "enrichDO")
cGO_neg <- compareCluster(geneCluster = gcSample_neg, fun = "enrichGO", OrgDb = org.Hs.eg.db)
cDO_neg <- compareCluster(geneCluster = gcSample_neg, fun = "enrichDO")
dotplot(cGO_pos,font.size=32, title="GO overrepresentation for the top-contributing genes of the positive tail")
cat('\n\n')
dotplot(cGO_neg,font.size=32, title="GO overrepresentation for the top-contributing genes of the negative tail")
cat('\n\n')
dotplot(ck_pos,font.size=32, title="KEGG overrepresentation for the top-contributing genes of the positive tail")
cat('\n\n')
dotplot(ck_neg,font.size=32, title="KEGG overrepresentation for the top-contributing genes of the negative tail")
cat('\n\n')
dotplot(cDO_pos,font.size=32, title="DO overrepresentation for the top-contributing genes of the positive tail")
cat('\n\n')
dotplot(cDO_neg,font.size=32, title="DO overrepresentation for the top-contributing genes of the negative tail")
```
# Annotation of the communities using gene sets of transcription factors
We used Dorothea database (Discriminant Regulon Expression Analysis, https://saezlab.github.io/dorothea/, Garcia-Alonso L et al. 2019) to construct gene sets of each of the reported transcription factor. We then checked if the communities were enriched in some of these TF gene sets.
For that, we performed over representation analysis of the top-contributing genes of the positive and negative tails of each community using the TF gene sets.
```{r communities_OR_TF_gene_sets, echo=FALSE, message=FALSE, warning = FALSE,fig.fullwidth = TRUE, fig.width=12}
TF = "gene_sets/TF/TF_targets_dorothea_A-D_hs.gmt"
TFgs <- read.gmt(TF)
egmt_pos_combine=data.frame(matrix(nrow=0,ncol=10))
egmt_neg_combine=data.frame(matrix(nrow=0,ncol=10))
out_TFgs <- NULL
for (i in 1:nrow(community_comp) )
{
out_TFgs <- c(out_TFgs, knit_child('template_enrichment_in_TF_genesets.Rmd', quiet = TRUE))
}
```
`r paste(out_TFgs, collapse='\n')`
```{r communities_OR_TF_gene_sets_datatable, echo=FALSE, message=FALSE, warning = FALSE,fig.fullwidth = TRUE, fig.width=12}
datatable(egmt_pos_combine_reordered, rownames = FALSE, caption="\n\n\nEnrichment in TF gene sets for top-contributing genes (pos tail)")
datatable(egmt_neg_combine_reordered, rownames = FALSE, caption="\n\n\nEnrichment in TF gene sets for top-contributing genes (pos tail)")
```
For each community, a total of `r length(unique(TFgs$term))` gene sets are tested.
There are a total of `r nrow(egmt_pos_combine_reordered)` significant enrichments in positive tails and `r nrow(egmt_neg_combine_reordered)` significant enrichments in negative tails.
There are `r length(unique(egmt_pos_combine_reordered$ID))` unique TFs with significant enrichments in their targets in the top-contributing genes of the positive tails and `r length(unique(egmt_neg_combine_reordered$ID))` unique TFs with significant enrichments in their targers in the top-contributing genes of the negative tails.
# Annotation of the communities using user defined gene sets
Now we want to check if the detected communities are enriched in specific gene sets. Here we used the gene sets identified in the work of Aynaud et al. 2019, where single cells ES data were investiguated using ICA. Also, additional gene sets are used.
For that, we performed over representation analysis of the top-contributing genes of the positive and negative tails of each community using the defined gene sets.
```{r communities_OR_personnalized_gene_sets, echo=FALSE, message=FALSE, warning = FALSE,fig.fullwidth = TRUE, fig.width=12}
specific_gene_sets = "gene_sets/ics_and_signatures.gmt"
sgs <- read.gmt(specific_gene_sets)
egmt_pos_combine=data.frame(matrix(nrow=0,ncol=10))
egmt_neg_combine=data.frame(matrix(nrow=0,ncol=10))
out_sgs <- NULL
for (i in 1:nrow(community_comp) )
{
out_sgs <- c(out_sgs, knit_child('template_enrichment_in_specific_genesets.Rmd', quiet = TRUE))
}
```
`r paste(out_sgs, collapse='\n')`
```{r communities_OR_personnalized_gene_sets_datatable, echo=FALSE, message=FALSE, warning = FALSE,fig.fullwidth = TRUE, fig.width=12}
datatable(egmt_pos_combine_reordered, rownames = FALSE, caption="\n\n\nEnrichment in user defined gene sets for top-contributing genes (pos tail)")
datatable(egmt_neg_combine_reordered, rownames = FALSE, caption="\n\n\nEnrichment in user defined gene sets for for top-contributing genes (pos tail)")
egmt_pos_combine_reordered_ID=egmt_pos_combine_reordered[grep("^IC",egmt_pos_combine_reordered$ID),]
egmt_neg_combine_reordered_ID=egmt_neg_combine_reordered[grep("^IC",egmt_neg_combine_reordered$ID),]
unique_IC_pos = length( union( unique(egmt_pos_combine_reordered_ID$ID[grep("[+]",egmt_pos_combine_reordered_ID$ID)]), unique(egmt_neg_combine_reordered_ID$ID[grep("[+]",egmt_neg_combine_reordered_ID$ID)])) )
unique_IC_neg = length( union( unique(egmt_pos_combine_reordered_ID$ID[grep("-",egmt_pos_combine_reordered_ID$ID)]), unique(egmt_neg_combine_reordered_ID$ID[grep("-",egmt_neg_combine_reordered_ID$ID)]) ) )
```
For each community, a total of `r length(unique(sgs$term))` gene sets are tested.
There are a total of `r nrow(egmt_pos_combine_reordered)` significant enrichments in positive tails and `r nrow(egmt_neg_combine_reordered)` significant enrichment in negative tails.
`r length(grep("^IC",egmt_pos_combine_reordered$ID))` IC are found in positive tails and `r length(grep("^IC",egmt_neg_combine_reordered$ID))` IC are found in negative tails.
`r length(unique(egmt_pos_combine_reordered$ID[grep("^IC",egmt_pos_combine_reordered$ID)]))` IC (`r length(unique(egmt_pos_combine_reordered_ID$ID[grep("[+]",egmt_pos_combine_reordered_ID$ID)]))` IC+ and `r length(unique(egmt_pos_combine_reordered_ID$ID[grep("-",egmt_pos_combine_reordered_ID$ID)]))` IC-) are retrieved in positive tails and `r length(unique(egmt_neg_combine_reordered$ID[grep("^IC",egmt_neg_combine_reordered$ID)]))` IC (`r length(unique(egmt_neg_combine_reordered_ID$ID[grep("[+]",egmt_neg_combine_reordered_ID$ID)]))` IC+ and `r length(unique(egmt_neg_combine_reordered_ID$ID[grep("-",egmt_neg_combine_reordered_ID$ID)]))` IC-) are retrieved in negative tails.
`r unique_IC_pos` IC+ and `r unique_IC_neg` IC- are retrived, whatever the tail in this analysis.