ch12 dim reduction.Rmd

---
title: "ch12 dim reduction"
author: "Bernie Mulvey"
date: "2023-05-15"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.height = 10,fig.width = 7,include = FALSE)
knitr::opts_chunk$set(fig.width=7,fig.height=10)
#### sets tab autocompletion for directories to begin from the directory containing the .Rproj session, instead of the script's directory if different
knitr::opts_knit$set(root.dir = here::here())

library(ggplot2)
# library(data.table)
library(Biostrings)
library(gridExtra)
require(colorout)
ColorOut()
options("styler.addins_style_transformer" = "biocthis::bioc_style()")
library(SpatialExperiment)
library(ggspavis)
library(scater) # addPerCellQC
library(nnSVG)
library(BiocParallel)
library(scran)
library(parallel)
# library(STexampleData)

theme_set(theme_bw()+theme(axis.text.x = element_text(size = 14), axis.title.x = element_text(size = 16), axis.text.y = element_text(size = 14), axis.title.y = element_text(size =16), plot.title = element_text(size = 20,hjust=0.5), strip.text = element_text(size=18), legend.text = element_text(size=10), legend.title = element_text(size=11,hjust=0.5)))
```


### PCA, quoth OSTA: "We use the computationally efficient implementation of PCA provided in the scater package (McCarthy et al. 2017). This implementation uses randomization, and therefore requires setting a random seed for reproducibility."

```{r}
splc2 <- readRDS("splc2_line201.RDS")
vargenes <- fread("Top 1310 ea SVGs and HVGs, union, and n union top SVGs.txt")
example.vg.vec <- vargenes$hvgs

## as coded in OSTA:
# splc2 <- runPCA(splc2,subset_row=example.vg.vec)

### note that this is then stored in a new part of the spe object under reducedDim(splc2,"PCA"). our UMAP will also go here, akin to the "counts" and "logcounts" entries in spe -> assays.


pcalist <- list()
vgs <- names(vargenes)
i <- 1
for (i in c(1:length(vgs))){
  pcalist[[i]] <- scater::runPCA(splc2,subset_row=(unique(vargenes[get(vgs[i])!="",get(vgs[i])])),name=paste0(vgs[i],"_PCA"))
  # names(pcalist)[i] <- vgs[i]
}
```

### UMAP on PCs: "We also run UMAP (McInnes, Healy, and Melville 2018) on the set of top 50 PCs and retain the top 2 UMAP components, which will be used for visualization purposes." (also in scater; returns the top 2 UMAP components by default)

### tSNE requires PCs, because it is moving points around based on their initial location as (PCx, PCy). UMAP doesn't EXPLICITLY require PCs, but is faster using them. Let's see how PCs->UMAP vs direct UMAP compare, too.


### scater::runUMAP uses logcounts by default if running on full data (exprs_values="logcounts"). BPPARAM can be specified and used to parallelize the PCA part of a from-raw-data UMAP task.

```{r}
### as written in OSTA, assuming we have one vector of genes we want to look at
# splc2 <- scater::runUMAP(splc2,dimred="PCA")
### 

# we could also just lapply this one but too late
umaplist <- list()
i<-1
for (i in c(1:length(pcalist))){
  set.seed(42)
  umaplist[[i]] <- scater::runUMAP(pcalist[[i]],dimred=paste0(vgs[i],"_PCA"),name = paste0(vgs[i],"_UMAP"))
  # per OSTA, easier to just rename the two UMAP columns to UMAP1 and UMAP2 for downstream plotting
  colnames(reducedDim(umaplist[[i]],paste0(vgs[i],"_UMAP"))) <- paste0("UMAP",1:2)
}
```


## directly generating UMAP from logcounts with parallelization-once again, parallelization issues here and its only using 1 cpu, but this seems to be particular to scater for whatever reason. i can't figure it from the underlying code either.

## oh, on second look, it only does that if you're using the nearest neighbors setting, which is not a default part of the analysis. der. no wonder the following didn't work

bpparam <- MulticoreParam(workers = 8,manager.hostname = "localhost",fallback = F)
register(bpparam)

for (i in c(1:length(pcalist))){
   umap.dir.list2[[i]] <- copy(splc2)
   umap.dir.list2[[i]] <- scater::runUMAP(umap.dir.list2[[i]],name = paste0(vgs[i],"_dirUMAP"),subset_row=(unique(vargenes[get(vgs[i])!="",get(vgs[i])])),BPPARAM=bpparam)
   colnames(reducedDim(umap.dir.list2[[i]],paste0(vgs[i],"_dirUMAP"))) <- paste0("UMAP",1:2)
}

# But THIS should
# umap is nondeterminstic, so set seed same as above for comparability (i.e., to see whether runUMAP on PCs is the same as runUMAP starting with the logcounts (which will be passed to runPCA first).
```{r}
bpparam <- MulticoreParam(workers = 4)
register(bpparam)

umap.dir.list <- bplapply(vgs,BPPARAM = bpparam,FUN = function(x){
  set.seed(42)
  tmp <- copy(splc2)
  tmp <- scater::runUMAP(tmp,name = paste0(x,"_dirUMAP"),subset_row=(unique(vargenes[get(x)!="",get(x)])))
  return(tmp)
})
# k THAT used 4 cores (one per list item). phew.

# save for loading into ch13
saveRDS(umaplist,"umaplist.RDS")
saveRDS(umap.dir.list,"umapdirlist.RDS")
saveRDS(pcalist,"pcalist.RDS")
```


So now we have a list of four SPE objects with PCA, 4 with PCA-derived UMAP, and 4 with UMAP run on the logcounts (which uses scater's runPCA function internally first anyways, so the latter two sets might look the same after all).So let's do some plotting of PCs and UMAPs.

this is done with ggspavis:: . and since theyre ggplots, we can put them all into a list and then GridExtra do.call gridarrange that shit for one big pdf to see it all at once.
```{r}
plts <- list()
i <- 1
for (i in c(1:12)){
  if(i<=4){
    curdim <- reducedDimNames(pcalist[[i]])
    plts[[i]] <- plotReducedDim(pcalist[[i]],dimred = curdim)
    rm(curdim)
  }
  else if(i>4&i<=8){
    curdim <- reducedDimNames(umaplist[[i-4]])[2]
    plts[[i]] <- plotReducedDim(umaplist[[i-4]],dimred=curdim)
    rm(curdim)
  }
  else{
    curdim <- reducedDimNames(umap.dir.list[[i-8]])
    plts[[i]] <- plotReducedDim(umap.dir.list[[i-8]],dimred=curdim)
    rm(curdim)
  }
}
library(gridExtra)
pdf("plots/ch12_PCs_UMAPsfromPCs_UMAPdirects_4genesets_051623.pdf",width=17,height=14)
do.call("grid.arrange",c(plts,ncol=4))
dev.off()
```

^ from this, we can see that including SVGs (left, right columns) really helps distinguish a couple groups on the UMAP that don't diverge as far using only HVGs (3rd column). using some of both seemed to be the most discerning (note the group inside the closed sideways pie slice in row 3, col 2, and compare to the other sets in the row)

### wrapper to color by each potential explanatory variable
```{r}
qcplotter <- function(colvar){
  plts <- list()
  i <- 1
  for (i in c(1:12)){
    if(i<=4){
      curdim <- reducedDimNames(pcalist[[i]])
      plts[[i]] <- plotReducedDim(pcalist[[i]],dimred = curdim,colour_by = colvar)
      rm(curdim)
    }
    else if(i>4&i<=8){
      curdim <- reducedDimNames(umaplist[[i-4]])[2]
      plts[[i]] <- plotReducedDim(umaplist[[i-4]],dimred=curdim,colour_by = colvar)
      rm(curdim)
    }
    else{
      curdim <- reducedDimNames(umap.dir.list[[i-8]])
      plts[[i]] <- plotReducedDim(umap.dir.list[[i-8]],dimred=curdim,colour_by = colvar)
      rm(curdim)
    }
  }
  library(gridExtra)
  pdf(paste0("plots/ch12_PCs_UMAPsfromPCs_UMAPdirects_4genesets_051623_colby_",colvar,".pdf"),width=17,height=14)
  do.call("grid.arrange",c(plts,ncol=4))
  dev.off()
}

# colData variables of interest: sample_id, round_id, sum, detected, subsets_mito_percent, cell_count, dtct1750, dtct200
cds <- c("sample_id","round_id","sum","detected","subsets_mito_percent","cell_count","dtct1750","dtct200")
lapply(cds,qcplotter)
```

## from these, we can see that the PCs primarily are separating samples from one another, which isn't unexpected since the LC is only a very small portion of each sample.