updated made during the workshop

stemangiola · stemangiola · commit a3920dac2140 · 2025-05-30T10:45:34.000+09:30
diff --git a/vignettes/Session_1_sequencing_assays.Rmd b/vignettes/Session_1_sequencing_assays.Rmd
@@ -195,7 +195,7 @@ Maynard and Torres et al., doi: [10.1038/s41593-020-00787-0](https://www.ncbi.nl
 library(spatialLIBD)
 library(ExperimentHub)
 
-# To avoid error for SPE loading 
+# To avoid error for SPE loading
 # https://support.bioconductor.org/p/9161859/#9161863
 setClassUnion("ExpData", c("matrix", "SpatialExperiment"))
 
@@ -218,6 +218,10 @@ spatial_data
 
 From: <https://bookdown.org/sjcockell/ismb-tutorial-2023/practical-session-2.html>
 
+::: {.note}
+If `ExperimentHub` should not work. The `spatial_data` object from the previous code block can be downloaded from [Zenodo - 10.5281/zenodo.11233385](https://zenodo.org/records/11233385/files/tidySpatialWorkshop2024_spatial_data.rds)
+:::
+
 We shows metadata for each cell, helping understand the dataset's structure.
 
 ```{r}
@@ -487,6 +491,23 @@ The final step in data preprocessing involves removing all spots identified as l
 spatial_data = spatial_data[,!colData(spatial_data)$discard ]
 ```
 
+::: {.note}
+**Exercise 1.0.1**
+
+It is good practice to perform quality control independently for each sample and different cell types. This because samples and cell types (or tissue regions) could have a distinct baseline distributions of quality control factors (e.g. mitochondrial transcription).
+
+1) Let's try to plot `subsets_mito_percent` grouping by sample, using `ggplot2`.
+2) Also, let's try to add tissue regions (present in the `colData` as already described) as colors, using `ggplot2`
+:::
+
+::: {.note}
+**Exercise 1.0.2**
+
+Thresholding is a easy-to-understand approach, but often arbitrary. A better strategy is outlier detection. With this strategy the baseline distribution of a QC factor (e.g. mitochondrial transcription) will be used to detect anomalous spots/cells. Read the documentation of `scater::isOutlier`, and use it to label outlier spots for mitochondrial transcription.
+
+Then, note which method is the most stringent, between our thresholding and outlier-detection.
+:::
+
 ### 6. Dimensionality reduction
 
 Dimensionality reduction is essential in spatial transcriptomics due to the high-dimensional nature of the data, which includes vast gene expression profiles across various spatial locations. Techniques such as PCA (Principal Component Analysis) and UMAP (Uniform Manifold Approximation and Projection) are particularly valuable. PCA helps to reduce noise and highlight the most significant variance in the data, making it simpler to uncover underlying patterns and correlations. UMAP, ofen calculated from principal components (and not directly from features) preserves both global and local data structures, enabling more nuanced visualisations of complex cellular landscapes. Together, these methods facilitate a deeper understanding of spatial gene expression, helping to reveal biological insights such as cellular heterogeneity and tissue structure, which are crucial for both basic biological research and clinical applications.
@@ -749,7 +770,7 @@ cluster_metadata =
   colData(spatial_data_list$sample_151673)[, c("clust_M0_lam0.2_k50_res0.7", "clust_M0_lam0.2_k50_res0.7_smooth")]
 
  
-knitr::kable(head(cluster_metadata), format = "html")
+knitr::kable(head(cluster_metadata, 10), format = "html")
 ```
 
 Using cluster comparison metrics like the adjusted Rand index (ARI) we evaluate the performance of our clustering approach. This statistical analysis helps validate the clustering results against known labels or pathologies.
@@ -854,23 +875,20 @@ table(brain_reference$cell_types)
 
 ```
 
-These are the number of samples we have for each of the three data sets.
+These are the number of samples we have.
 
 ```{r}
 
 table(brain_reference$sample)
 ```
 
 
-Now, we identify the variable genes within each dataset, to not capture technical effects, and identify the union of variable genes for further analysis.
+Now, we identify the variable genes, to not capture technical effects, and identify the union of variable genes for further analysis.
 
 ```{r, warning=FALSE}
 genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(brain_reference))
 
-# Convert to list
-brain_reference_list <- lapply(unique(brain_reference$dataset_id), function(x) brain_reference[, brain_reference$dataset_id == x])
-
-dec = scran::modelGeneVar(brain_reference, subset.row = genes, block = brain_reference$sample_id)
+dec = scran::modelGeneVar(brain_reference, subset.row = genes, block = brain_reference$sample)
 hvg_CAQ = scran::getTopHVGs(dec, n = 1000)
             
 hvg_CAQ = unique( unlist(hvg_CAQ))
@@ -990,6 +1008,7 @@ No, let's look at the correlation matrices to see which cell type are most often
 
 plotCorrelationMatrix(res$mat)
 ```
+
 ```{r}
 mat_df = as.data.frame(res$mat)
 ```
diff --git a/vignettes/Session_2_Tidy_spatial_analyses.Rmd b/vignettes/Session_2_Tidy_spatial_analyses.Rmd
@@ -111,8 +111,9 @@ rownames(spatialCoords(spatial_data)) = colnames(spatial_data) # Bug?
 # Display the object
 spatial_data
 ```
+
 ::: {.note}
-If `ExperimentHub` should not work. The `spatial_data` object from the previous code block can be downloaded from [Zenodo - 10.5281/zenodo.11233385](https://zenodo.org/records/11233385/files/tidySpatialWorkshop_spatial_data.rds?download=1)
+If `ExperimentHub` should not work. The `spatial_data` object from the previous code block can be downloaded from [Zenodo - 10.5281/zenodo.11233385](https://zenodo.org/records/11233385/files/tidySpatialWorkshop2024_spatial_data.rds)
 :::
 
 ## Working with tidySpatialExperiment
@@ -180,7 +181,7 @@ spatial_data |> select(.cell, sample_id, in_tissue, spatialLIBD)
 ```
 
 ::: {.note}
-Note that some columns are always displayed no matter whet. These column include special slots in the objects such as reduced dimensions, spatial coordinates (mandatory for `SpatialExperiment`), and sample identifier (mandatory for `SpatialExperiment`). 
+Note that some columns are always displayed no matter what. These column include special slots in the objects such as reduced dimensions, spatial coordinates (mandatory for `SpatialExperiment`), and sample identifier (mandatory for `SpatialExperiment`). 
 :::
 
 Although the select operation can be used as a display tool, to explore our object, it updates the `SpatialExperiment` metadata, subsetting the desired columns.
@@ -277,7 +278,7 @@ spatial_data |>
 We can update the underlying `SpatialExperiment` object, for future analyses. And confirm that the `SpatialExperiment` metadata has been mutated.
 
 ```{r message=FALSE}
-spatial_data = 
+spatial_data <- 
   spatial_data |>
   mutate(spatialLIBD_lower = tolower(spatialLIBD))
 
@@ -313,11 +314,12 @@ Extract specific identifiers from complex data paths, simplifying the dataset by
 ```{r}
 # Create column for sample
 spatial_data <- spatial_data |>
+  
   # Extract sample ID from file path and display the updated data
   tidyr::extract(file_path, "sample_id_from_file_path", "\\.\\./data/single_cell/([0-9]+)/outs/raw_feature_bc_matrix/", remove = FALSE)
 
 # Take a look
-spatial_data |> select(.cell, sample_id_from_file_path, everything())
+spatial_data |> select(.cell, sample_id_from_file_path, file_path, everything())
 ```
 
 #### Unite
@@ -329,7 +331,7 @@ We could use tidyverse `unite` to combine columns, for example to create a new c
 spatial_data <- spatial_data |> unite("sample_subject", sample_id, subject, remove = FALSE)
 
 # Take a look
-spatial_data |> select(.cell, sample_id, sample_subject, subject)
+spatial_data |> select(.cell, sample_id, subject, sample_subject)
 ```
 
 ### 3. Advanced filtering/gating and pseudobulk
@@ -350,9 +352,19 @@ spatial_data =
   # Gate based on tissue morphology
   tidySpatialExperiment::gate(alpha = 0.1, colour = "spatialLIBD") 
 
+spatial_data |> select(.cell, .gated)
+```
+
+```{r, eval=FALSE}
+tidygate_env$gates |> saveRDS("<PATH>")
+```
+
+```{r}
 spatial_data_gated = tidygate_env$gates
 ```
 
+
+
 You can reload a pre-made gate for reproducibility
 
 ```{r}
@@ -433,7 +445,7 @@ Join the feature to the metadata
 ```{r}
 spatial_data = 
   spatial_data |> 
-  join_features("ENSG00000131095", shape="wide")
+  join_features("ENSG00000131095", shape="wide", assay = "logcounts")
 
 spatial_data |> 
   select(.cell, ENSG00000131095)
@@ -443,13 +455,15 @@ spatial_data |>
 
 ::: {.note}
 **Exercise 2.2**
+
 Join the endothelial marker PECAM1 (CD31, look for ENSEMBL ID), and plot in space the pixel that are in the 0.75 percentile of EPCAM1 expression. Are the PECAM1-positive pixels (endothelial?) spatially clustered?
 
 - Get the ENSEMBL ID
 - Join the feature to the tidy data abstraction
 - Calculate the 0.75 quantile across all pixels `mutate()`
 - Label the cells with high PECAM1
 - Plot the slide colouring for the new label 
+
 :::
 
 
@@ -484,7 +498,7 @@ We calculate summary statistics of a subset of data
 
 ```{r}
 spatial_data |> 
-filter(Cluster==1) |> 
+  filter(Cluster==1) |> 
   count(sample_id) |> 
   arrange(desc(n))
 
@@ -589,9 +603,7 @@ spatial_data <-
   addPerCellQC(subsets = list(mito = is_gene_mitochondrial)) |> 
   
   ## Add threshold in colData
-  mutate(
-    qc_mitochondrial_transcription = subsets_mito_percent > 30
-  )
+  mutate( qc_mitochondrial_transcription = subsets_mito_percent > 30 )
 
 spatial_data |> select(.cell, qc_mitochondrial_transcription)
 
@@ -623,6 +635,8 @@ marker_genes =
           }
   ) 
 
+marker_genes = cbind(marker_genes)
+
 head(unique(unlist(marker_genes)))
 
 ```
@@ -726,6 +740,15 @@ spatial_data_filtered =
 **Maintainability:** Fewer and self-explanatory lines of code and no need for intermediate steps make the code easier to maintain and modify, especially when conditions change or additional filters are needed.
 
 
+::: {.note}
+**Exercise 2.2.1**
+
+In Session 1 we showed that a good strategy for QC filtering is outlier detection. With this strategy the baseline distribution of a QC factor (e.g. mitochondrial transcription) will be used to detect anomalous spots/cells. Read the documentation of `scater::isOutlier`, and use it WITH `tidyomics`/`tidyverse` to label outlier spots for mitochondrial transcription.
+
+Then, note which method is the most stringent, between our thresholding and outlier-detection, solely using `tidyomics`/`tidyverse`.
+
+:::
+
 ### 7. Visualisation
 
 Here, we will show how to use ad-hoc spatial visualisation, as well as `ggplot` to explore spatial data we will show how `tidySpatialExperiment` allowed to alternate between tidyverse visualisation, and any visualisation compatible with `SpatialExperiment`. 
@@ -780,9 +803,10 @@ We provide another example of how the use of tidy. Spatial experiment makes cust
 spatial_data_filtered |> 
   ggplot(aes(subsets_mito_percent, sum_gene)) + 
   geom_point(aes(color = spatialLIBD), size=0.2) +  
-  stat_ellipse(aes(group = spatialLIBD), alpha = 0.3) +
+  stat_ellipse(aes(group = spatialLIBD, color = spatialLIBD), alpha = 0.3) +
   scale_color_manual(values = libd_layer_colors |>
   str_remove("ayer")) +
+  geom_smooth(aes(group = spatialLIBD), method="lm") +
   scale_y_log10() +
   theme_bw()
 
@@ -828,7 +852,7 @@ We assume that the cells we filtered as non-alive or damaged, characterised by b
 
 Use `tidyomic`/`tidyverse` tools to label dead cells and perform differential expression within each region. Some of the comments you can use are: `mutate`, `nest`, `map`, `aggregate_cells`, `tidybulk:::test_differential_abundance`, 
 
-A hist:
+A hint:
 
 - spatial_data |> 
   mutate(
diff --git a/vignettes/Solutions.Rmd b/vignettes/Solutions.Rmd
@@ -37,6 +37,31 @@ spot_counts <-
 spot_counts
 ```
 
+::: {.note}
+**Exercise 1.0.1**
+:::
+
+```{r}
+colData(spatial_data) |> 
+  ggplot(aes(subsets_mito_percent)) + 
+  geom_density(aes(linetype = sample_id)) 
+
+colData(spatial_data) |> 
+  ggplot(aes(subsets_mito_percent)) + 
+  geom_density(aes(color = spatialLIBD, linetype = sample_id)) 
+
+
+```
+
+::: {.note}
+**Exercise 1.0.2**
+:::
+
+```{r}
+scater::isOutlier(colData(spatial_data)$subsets_mito_percent, type="higher") |> table()
+colData(spatial_data)$qc_mitochondrial_transcription |> table()
+```
+
 ::: {.note}
 **Exercise 1.1**
 :::
@@ -232,10 +257,10 @@ spatial_data |>
   filter(in_tissue, sample_id=="151673") |> 
   
   # Gate based on tissue morphology
-  tidySpatialExperiment::gate_spatial(alpha = 0.1) |> 
+  tidySpatialExperiment::gate(alpha = 0.1) |> 
 
   # Plot
-  scater::plotUMAP(colour_by = ".gate")
+  scater::plotUMAP(colour_by = ".gated")
 ```
 
 
@@ -252,22 +277,36 @@ rowData(spatial_data) |>
 gene = "ENSG00000261371"
 
 spatial_data |> 
-
+  
   # Join the feature
-  join_features(gene, shape="wide") |> 
-
+  join_features(gene, shape="wide", assay = "logcounts") |> 
+  
   # Calculate the quantile
-  mutate(my_quantile = quantile(ENSG00000261371, 0.75)) |> 
-
+  mutate(my_quantile = quantile(ENSG00000261371, 0.99)) |> 
+  # mutate(my_quantile = ENSG00000261371 > 0) |> # A possibility
+  
   # Label the pixels
-  mutate(PECAM1_positive = ENSG00000261371 > my_quantile) |> 
-
+  mutate(PECAM1_positive = ENSG00000261371 >= my_quantile) |> 
+  
   # Plot
   ggspavis::plotSpots(annotate = "PECAM1_positive") +
   facet_wrap(~sample_id) 
 
 ```
 
+::: {.note}
+**Exercise 2.2.1**
+:::
+
+```{r}
+spatial_data |> 
+  mutate(qc_mitochondrial_transcription = scater::isOutlier(subsets_mito_percent, type="higher")) |> 
+  count(qc_mitochondrial_transcription) 
+
+spatial_data |> 
+  count(qc_mitochondrial_transcription) 
+
+```
 
 ::: {.note}
 **Excercise 2.3**
@@ -532,4 +571,5 @@ annotate = "region"
     guides(color = "none") +
     labs(title = "ground truth")
 
-```
+```
+