added kontextual + classifyR [skip ci]

SydneyBioX · Oct 28, 2024 · 5cfc4ed · 5cfc4ed
1 parent 27664de
commit 5cfc4ed
Show file tree

Hide file tree

Showing 2 changed files with 369 additions and 5 deletions.
diff --git a/inst/extdata/kontext.rda b/inst/extdata/kontext.rda
diff --git a/vignettes/scdneySpatial.Rmd b/vignettes/scdneySpatial.Rmd
@@ -13,7 +13,6 @@ editor_options:
     wrap: 72
 ---
 
-
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   #collapse = TRUE,
@@ -35,11 +34,10 @@ knitr::opts_chunk$set(
 }
 </style>
 ```
-
 **Presenting Authors**
 
-Farhan Ameen$^{1,2,3}$, Alex Qin$^{1,2,3}$, Shreya Rao$^{1,2,3}$,
-Ellis Patrick$^{1,2,3}$.
+Farhan Ameen$^{1,2,3}$, Alex Qin$^{1,2,3}$, Shreya Rao$^{1,2,3}$, Ellis
+Patrick$^{1,2,3}$.
 
 $^1$ Westmead Institute for Medical Research, University of Sydney,
 Australia\
@@ -48,4 +46,370 @@ Australia\
 $^3$ School of Mathematics and Statistics, University of Sydney,
 Australia
 
-<br/> Contact: ellis.patrick\@sydney.edu.au
+<br/> Contact: ellis.patrick\@sydney.edu.au
+
+## Overview
+
+Understanding the interplay between different types of cells and their
+immediate environment is critical for understanding the mechanisms of
+cells themselves and their function in the context of human diseases.
+Recent advances in high dimensional in situ cytometry technologies have
+fundamentally revolutionized our ability to observe these complex
+cellular relationships providing an unprecedented characterisation of
+cellular heterogeneity in a tissue environment.
+
+### Description
+
+In this workshop we will introduce some of the key analytical concepts
+needed to analyse data from high dimensional spatial omics technologies
+such as, PhenoCycler, IMC, Xenium and MERFISH. We will show how
+functionality from our Bioconductor packages simpleSeg, FuseSOM,
+scClassify, scHot, spicyR, listClust, statial, scFeatures and ClassifyR
+can be used to address various biological hypotheses. By the end of this
+workshop attendees will be able to implement and assess some of the key
+steps of a spatial analysis pipeline including cell segmentation,
+feature normalisation, cell type identification, microenvironment and
+cell-state characterisation, spatial hypothesis testing and patient
+classification. Understanding these key steps will provide attendees
+with the core skills needed to interrogate the comprehensive spatial
+information generated by these exciting new technologies.
+
+### Pre-requisites
+
+It is expected that students will have:
+
+-   basic knowledge of R syntax,
+-   this workshop will not provide an in-depth description of
+    cell-resolution spatial omics technologies.
+
+### *R* / *Bioconductor* packages used
+
+Several single cell R packages will be used from the scdney package, for
+more information visit: <https://sydneybiox.github.io/scdney/>.
+
+### Learning objectives
+
+-   Understand and visualise spatial omics datasets.
+-   Identify key biological questions that can be addressed with these
+    technologies and spatial analysis.
+-   Understand the key analytical steps involved in spatial omics
+    analysis, and perform these steps using R.
+-   Evaluate the performance of data normalisation and cell
+    segmentation.
+-   Understand and generate individual feature representations from
+    spatial omics data.
+-   Develop appreciation on how to assess performance of classification
+    models.
+-   Perform disease outcome prediction using the feature representation
+    and robust classification framework.
+
+## Workshop
+
+## Setup
+
+### Installation
+
+```{r install, warning=FALSE, message=FALSE, eval = FALSE}
+if (!requireNamespace("BiocManager", quietly = TRUE)) {
+    install.packages("BiocManager")
+}
+
+BiocManager::install(c( "simpleSeg", "cytomapper", "scClassify", "scHOT", "FuseSOM","glmnet", "spicyR", "lisaClust","Statial", "scFeatures", "ClassifyR", "tidyverse", "scater", "SingleCellExperiment", "STexampleData", "SpatialDatasets", "tidySingleCellExperiment", "scuttle", "batchelor"))
+```
+
+### Load packages
+
+```{r library, warning=FALSE, message=FALSE}
+
+# packages from scdney
+library(SpatialDatasets)
+library(simpleSeg)
+library(FuseSOM)
+library(lisaClust)
+library(spicyR)
+#library(Statial)
+devtools::load_all("~/Desktop/UNI/Phd/Projects/Statial")
+devtools::load_all("~/Desktop/UNI/Phd/Projects/spicyR")
+library(ClassifyR)
+
+# Other required packages
+library(tidySingleCellExperiment)
+library(tidyverse)
+library(SingleCellExperiment)
+
+
+theme_set(theme_classic())
+
+nCores <- 8  # Feel free to parallelise things if you have the cores to spare.
+BPPARAM <- simpleSeg:::generateBPParam(nCores)
+options("restore_SingleCellExperiment_show" = TRUE)
+
+```
+
+### Download datasets
+
+Please download these datasets before you start the workshop
+
+```{r data}
+#Download data now
+#SpatialDatasets::Ferguson_Images()
+spe = SpatialDatasets::spe_Ferguson_2022()
+
+spe$x = spatialCoords(spe)[, "x"]
+spe$y = spatialCoords(spe)[, "y"]
+```
+
+## SimpleSeg: Segmentation and normalisation
+
+### ALEX CHANGE LATER: Normalise and cluster
+
+```{r normalise}
+
+useMarkers <- rownames(spe)[!rownames(spe) %in% c("DNA1", "DNA2", "HH3")]
+
+spe <- normalizeCells(spe,
+                        markers = useMarkers,
+                        transformation = NULL,
+                        method = c("trim99", "mean", "PC1"),
+                        assayIn = "counts",
+                        cores = nCores
+)
+
+# Usually we specify a subset of the original markers which are informative to separating out distinct cell types for the UMAP and clustering.
+ct_markers <- c("podoplanin", "CD13", "CD31",
+                "panCK", "CD3", "CD4", "CD8a",
+                "CD20", "CD68", "CD16", "CD14", "HLADR", "CD66a")
+
+
+# Set seed.
+set.seed(51773)
+
+
+
+# Generate SOM and cluster cells into 10 groups
+spe <- runFuseSOM(
+  spe,
+  markers = ct_markers,
+  assay = "norm",
+  numClusters = 10
+)
+
+
+spe <- spe |>
+  mutate(cellType = case_when(
+    clusters == "cluster_1" ~ "GC",
+    clusters == "cluster_2" ~ "MC",
+    clusters == "cluster_3" ~ "SC",
+    clusters == "cluster_4" ~ "EP",
+    clusters == "cluster_5" ~ "SC",
+    clusters == "cluster_6" ~ "TC_CD4",
+    clusters == "cluster_7" ~ "BC",
+    clusters == "cluster_8" ~ "EC",
+    clusters == "cluster_9" ~ "TC_CD8",
+    clusters == "cluster_10" ~ "DC"
+  ))
+```
+
+## Statial
+
+[Statial
+Package](https://www.bioconductor.org/packages/release/bioc/html/Statial.html)
+
+### SpatioMark
+
+```{r}
+# Preparing outcome vector
+outcome <- spe$group[!duplicated(spe$imageID)]
+names(outcome) <- spe$imageID[!duplicated(spe$imageID)]
+```
+
+### Kontextual
+
+[Pre-print](https://www.biorxiv.org/content/10.1101/2024.09.03.611109v1)
+
+```{r hierarchy}
+# Cluster cell type hierarchy
+hierarchy <- treekoR::getClusterTree(t(assay(spe, "norm")),
+                                     clusters = spe$cellType,
+                                     hierarchy_method = "hopach")
+
+
+parentList = Statial::getParentPhylo(hierarchy)
+
+# Visualise tree
+hierarchy$clust_tree |> 
+  plot()
+
+# Add parent 4 and 5
+parentList$parent_4 = c(parentList$parent_2, parentList$parent_3)
+parentList$parent_5 = c(parentList$parent_1, "EC")
+
+all = unique(spe$cellType)
+
+# Create data frame with all pairwise combinations of cells
+treeDf = Statial::parentCombinations(all, parentList = parentList)
+
+```
+
+```{r kontextual, eval = FALSE}
+kontext = Kontextual(
+  cells = spe,
+  cellType = "cellType",
+  r = 50,
+  parentDf = treeDf,
+  cores = nCores
+)
+```
+
+To save time we have saved the output of the results above.
+
+```{r loadKontextual}
+load(system.file("extdata", "kontextual.rda", package = "ScdneySpatial"))
+```
+
+```{r}
+imageChoose = "H3"
+
+spe |>
+  colData() |>
+  as.data.frame() |>
+  filter(imageID == imageChoose) |>
+  filter(cellType %in% c("SC", "TC_CD8", parentList$parent_4)) |>
+  mutate(cellType = case_when(
+    cellType %in% parentList$parent_4[parentList$parent_4 != "TC_CD8"] ~ "Immune",
+    TRUE ~ cellType
+  )) |> 
+  mutate(cellType = factor(cellType, levels = c("TC_CD8",  "SC", "Immune"))) |> 
+  arrange(desc(cellType)) |> 
+  ggplot(aes(x = x, y = y, color = cellType)) +
+  geom_point(size = 1) +
+  scale_colour_manual(values = c("SC" = "#404040", "TC_CD8" = "#E25758", "Immune" = "#D7D8D8")) +
+  guides(colour = guide_legend(title = "Cell types", override.aes = list(size = 3)))
+
+```
+
+We can use the `spicyR::spicy` to test for associations between the `Kontextual` values and progression status. To do so we use the `alternateResult` arguement.
+
+```{r kontextOutcome, fig.height = 5, fig.width = 10}
+# Converting Kontextual result into data matrix
+kontextMat <- prepMatrix(kontext)
+
+# Replace NAs with 0
+kontextMat[is.na(kontextMat)] <- 0
+
+# Test association with outcome
+kontextOutcome = spicyR::spicy(
+  cells = spe,
+  alternateResult = kontextMat,
+  condition = "group",
+  BPPARAM = BPPARAM
+)
+
+# Plot results
+signifPlot(kontextOutcome)
+```
+
+# ClassifyR: Patient classification
+
+[ClassifyR
+Package](https://www.bioconductor.org/packages/release/bioc/html/ClassifyR.html)
+
+[Journal
+article](https://academic.oup.com/bioinformatics/article/31/11/1851/2365648?login=false)
+
+
+Our ClassifyR package formalises a convenient framework for evaluating classification in R. We provide functionality to easily include four key modelling stages; Data transformation, feature selection, classifier training and prediction; into a cross-validation loop. Here we use the `crossValidate` function to perform 50 repeats of 5-fold cross-validation to evaluate the performance of a random forest applied to five quantifications of our IMC data; 1) Cell type proportions 2) Cell type localisation from `spicyR` 3) Region proportions from `lisaClust` 4) Cell type localisation in reference to a parent cell type from `Kontextual` 5) Cell changes in response to proximal changes from `SpatioMark`
+
+```{r features}
+# Create list to store data.frames
+data <- list()
+
+# Add proportions of each cell type in each image
+data[["Proportions"]] <- getProp(spe, "cellType")
+
+# Add pair-wise associations
+# spicyMat <- bind(spicyTest)
+# spicyMat[is.na(spicyMat)] <- 0
+# spicyMat <- spicyMat |>
+#   select(!condition) |>
+#   tibble::column_to_rownames("imageID")
+# 
+# data[["SpicyR"]] <- spicyMat
+
+# Add proportions of each region in each image to the list of dataframes.
+#data[["LisaClust"]] <- getProp(cells, "region")
+
+
+# Add SpatioMark features
+#data[["SpatioMark"]] <- distMat
+
+# Add Kontextual features
+data[["Kontextual"]] <- kontextMat
+```
+
+
+```{r classification}
+# Set seed
+set.seed(51773)
+
+# Perform 5-fold cross-validation with 50 repeats.
+cv <- crossValidate(
+  measurements = data,
+  outcome = outcome,
+  selectionMethod = "t-test",
+  classifier = "LASSOGLM",
+  nFolds = 5,
+  nFeatures = 50,
+  nRepeats = 50,
+  nCores = nCores
+)
+```
+
+## Visualise cross-validated prediction performance
+
+Here we use the `performancePlot` function to assess the AUC from each repeat of the 5-fold cross-validation.
+
+```{r performancePlot}
+performancePlot(
+  cv,
+  metric = "AUC",
+  characteristicsList = list(x = "Assay Name"),
+  orderingList = list("Assay Name" = c("Proportions", "SpicyR", "LisaClust", "Kontextual", "SpatioMark"))
+)
+```
+
+
+::: question
+**Questions**
+
+1. Run the following chunk of code, what is this plot telling us?
+2. Do different methods perform differently on different samples?
+3. Knowing this, what might we do to improve our model?
+:::
+
+
+
+```{r sampleMetrics}
+samplesMetricMap(cv) |> 
+  plot()
+
+```
+
+
+```{r multiViewClassification}
+cv <- crossValidate(
+  measurements = data,
+  outcome = outcome,
+  classifier = "randomForest",
+  nFolds = 5,
+  nRepeats = 50,
+  multiViewMethod = "merge",
+  nCores = nCores
+)
+
+performancePlot(
+  cv,
+  metric = "AUC",
+  characteristicsList = list(x = "Assay Name"),
+)
+```