-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch12 dim reduction.Rmd
178 lines (147 loc) · 7.4 KB
/
ch12 dim reduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "ch12 dim reduction"
author: "Bernie Mulvey"
date: "2023-05-15"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.height = 10,fig.width = 7,include = FALSE)
knitr::opts_chunk$set(fig.width=7,fig.height=10)
#### sets tab autocompletion for directories to begin from the directory containing the .Rproj session, instead of the script's directory if different
knitr::opts_knit$set(root.dir = here::here())
library(ggplot2)
# library(data.table)
library(Biostrings)
library(gridExtra)
require(colorout)
ColorOut()
options("styler.addins_style_transformer" = "biocthis::bioc_style()")
library(SpatialExperiment)
library(ggspavis)
library(scater) # addPerCellQC
library(nnSVG)
library(BiocParallel)
library(scran)
library(parallel)
# library(STexampleData)
theme_set(theme_bw()+theme(axis.text.x = element_text(size = 14), axis.title.x = element_text(size = 16), axis.text.y = element_text(size = 14), axis.title.y = element_text(size =16), plot.title = element_text(size = 20,hjust=0.5), strip.text = element_text(size=18), legend.text = element_text(size=10), legend.title = element_text(size=11,hjust=0.5)))
```
### PCA, quoth OSTA: "We use the computationally efficient implementation of PCA provided in the scater package (McCarthy et al. 2017). This implementation uses randomization, and therefore requires setting a random seed for reproducibility."
```{r}
splc2 <- readRDS("splc2_line201.RDS")
vargenes <- fread("Top 1310 ea SVGs and HVGs, union, and n union top SVGs.txt")
example.vg.vec <- vargenes$hvgs
## as coded in OSTA:
# splc2 <- runPCA(splc2,subset_row=example.vg.vec)
### note that this is then stored in a new part of the spe object under reducedDim(splc2,"PCA"). our UMAP will also go here, akin to the "counts" and "logcounts" entries in spe -> assays.
pcalist <- list()
vgs <- names(vargenes)
i <- 1
for (i in c(1:length(vgs))){
pcalist[[i]] <- scater::runPCA(splc2,subset_row=(unique(vargenes[get(vgs[i])!="",get(vgs[i])])),name=paste0(vgs[i],"_PCA"))
# names(pcalist)[i] <- vgs[i]
}
```
### UMAP on PCs: "We also run UMAP (McInnes, Healy, and Melville 2018) on the set of top 50 PCs and retain the top 2 UMAP components, which will be used for visualization purposes." (also in scater; returns the top 2 UMAP components by default)
### tSNE requires PCs, because it is moving points around based on their initial location as (PCx, PCy). UMAP doesn't EXPLICITLY require PCs, but is faster using them. Let's see how PCs->UMAP vs direct UMAP compare, too.
### scater::runUMAP uses logcounts by default if running on full data (exprs_values="logcounts"). BPPARAM can be specified and used to parallelize the PCA part of a from-raw-data UMAP task.
```{r}
### as written in OSTA, assuming we have one vector of genes we want to look at
# splc2 <- scater::runUMAP(splc2,dimred="PCA")
###
# we could also just lapply this one but too late
umaplist <- list()
i<-1
for (i in c(1:length(pcalist))){
set.seed(42)
umaplist[[i]] <- scater::runUMAP(pcalist[[i]],dimred=paste0(vgs[i],"_PCA"),name = paste0(vgs[i],"_UMAP"))
# per OSTA, easier to just rename the two UMAP columns to UMAP1 and UMAP2 for downstream plotting
colnames(reducedDim(umaplist[[i]],paste0(vgs[i],"_UMAP"))) <- paste0("UMAP",1:2)
}
```
## directly generating UMAP from logcounts with parallelization-once again, parallelization issues here and its only using 1 cpu, but this seems to be particular to scater for whatever reason. i can't figure it from the underlying code either.
## oh, on second look, it only does that if you're using the nearest neighbors setting, which is not a default part of the analysis. der. no wonder the following didn't work
bpparam <- MulticoreParam(workers = 8,manager.hostname = "localhost",fallback = F)
register(bpparam)
for (i in c(1:length(pcalist))){
umap.dir.list2[[i]] <- copy(splc2)
umap.dir.list2[[i]] <- scater::runUMAP(umap.dir.list2[[i]],name = paste0(vgs[i],"_dirUMAP"),subset_row=(unique(vargenes[get(vgs[i])!="",get(vgs[i])])),BPPARAM=bpparam)
colnames(reducedDim(umap.dir.list2[[i]],paste0(vgs[i],"_dirUMAP"))) <- paste0("UMAP",1:2)
}
# But THIS should
# umap is nondeterminstic, so set seed same as above for comparability (i.e., to see whether runUMAP on PCs is the same as runUMAP starting with the logcounts (which will be passed to runPCA first).
```{r}
bpparam <- MulticoreParam(workers = 4)
register(bpparam)
umap.dir.list <- bplapply(vgs,BPPARAM = bpparam,FUN = function(x){
set.seed(42)
tmp <- copy(splc2)
tmp <- scater::runUMAP(tmp,name = paste0(x,"_dirUMAP"),subset_row=(unique(vargenes[get(x)!="",get(x)])))
return(tmp)
})
# k THAT used 4 cores (one per list item). phew.
# save for loading into ch13
saveRDS(umaplist,"umaplist.RDS")
saveRDS(umap.dir.list,"umapdirlist.RDS")
saveRDS(pcalist,"pcalist.RDS")
```
So now we have a list of four SPE objects with PCA, 4 with PCA-derived UMAP, and 4 with UMAP run on the logcounts (which uses scater's runPCA function internally first anyways, so the latter two sets might look the same after all).So let's do some plotting of PCs and UMAPs.
this is done with ggspavis:: . and since theyre ggplots, we can put them all into a list and then GridExtra do.call gridarrange that shit for one big pdf to see it all at once.
```{r}
plts <- list()
i <- 1
for (i in c(1:12)){
if(i<=4){
curdim <- reducedDimNames(pcalist[[i]])
plts[[i]] <- plotReducedDim(pcalist[[i]],dimred = curdim)
rm(curdim)
}
else if(i>4&i<=8){
curdim <- reducedDimNames(umaplist[[i-4]])[2]
plts[[i]] <- plotReducedDim(umaplist[[i-4]],dimred=curdim)
rm(curdim)
}
else{
curdim <- reducedDimNames(umap.dir.list[[i-8]])
plts[[i]] <- plotReducedDim(umap.dir.list[[i-8]],dimred=curdim)
rm(curdim)
}
}
library(gridExtra)
pdf("plots/ch12_PCs_UMAPsfromPCs_UMAPdirects_4genesets_051623.pdf",width=17,height=14)
do.call("grid.arrange",c(plts,ncol=4))
dev.off()
```
^ from this, we can see that including SVGs (left, right columns) really helps distinguish a couple groups on the UMAP that don't diverge as far using only HVGs (3rd column). using some of both seemed to be the most discerning (note the group inside the closed sideways pie slice in row 3, col 2, and compare to the other sets in the row)
### wrapper to color by each potential explanatory variable
```{r}
qcplotter <- function(colvar){
plts <- list()
i <- 1
for (i in c(1:12)){
if(i<=4){
curdim <- reducedDimNames(pcalist[[i]])
plts[[i]] <- plotReducedDim(pcalist[[i]],dimred = curdim,colour_by = colvar)
rm(curdim)
}
else if(i>4&i<=8){
curdim <- reducedDimNames(umaplist[[i-4]])[2]
plts[[i]] <- plotReducedDim(umaplist[[i-4]],dimred=curdim,colour_by = colvar)
rm(curdim)
}
else{
curdim <- reducedDimNames(umap.dir.list[[i-8]])
plts[[i]] <- plotReducedDim(umap.dir.list[[i-8]],dimred=curdim,colour_by = colvar)
rm(curdim)
}
}
library(gridExtra)
pdf(paste0("plots/ch12_PCs_UMAPsfromPCs_UMAPdirects_4genesets_051623_colby_",colvar,".pdf"),width=17,height=14)
do.call("grid.arrange",c(plts,ncol=4))
dev.off()
}
# colData variables of interest: sample_id, round_id, sum, detected, subsets_mito_percent, cell_count, dtct1750, dtct200
cds <- c("sample_id","round_id","sum","detected","subsets_mito_percent","cell_count","dtct1750","dtct200")
lapply(cds,qcplotter)
```
## from these, we can see that the PCs primarily are separating samples from one another, which isn't unexpected since the LC is only a very small portion of each sample.