In this workshop, we will learn how clustSIGNAL, a spatial cell-type clustering method, works and how to run it on R. There will also be discussion on the various parameters available for running the method, assessing the relevance of the clusters, and additional information generated by clustSIGNAL that could be used for other analyses.
The clustSIGNAL R package is available to download and install from github/SydneyBioX/clustSIGNAL.
# install.packages("devtools")
devtools::install_github("SydneyBioX/clustSIGNAL")
The attendees are expected to have:
- Experience in R programming, and
- Familiarity with SingleCellExperiment and/or SpatialExperiment objects.
Below is a list of tasks we will go through during the workshop.
Task | Time |
---|---|
Concept behind clustSIGNAL | 10 min |
Method parameters | 5 min |
How to run clustSIGNAL | 10 min |
Assessing relevance of clusters | 10 min |
Exploring clustSIGNAL outputs | 10 min |
Summary and general discussion | 15 min |
- Understand how clustSIGNAL works, how it embeds spatial information into the data, and how that is used for clustering.
- Learn about the different clustering parameters available in clustSIGNAL.
- Perform clustSIGNAL clustering on spatial transcriptomics data.
- Visualise the clusters obtained from running clustSIGNAL and evaluate their relevance.
- Learn about the different outputs generated by clustSIGNAL and how they can be useful for data exploration.
clustSIGNAL (clustering of Spatially Informed Gene expression with Neighbourhood Adapted Learning) is a cell-type clustering method for high-resolution spatial transcriptomics data. It aims to address data sparsity by performing an adaptive smoothing approach to generate modified gene expressions that are embedded with spatial context and neighbourhood composition information.
To capture neighbourhood composition of each cell, we used entropy as a measure of the "domainness" of cell neighbourhoods - the more homogeneous a neighbourhood the lower its entropy and the more heterogeneous a neighbourhood the higher its entropy. The entropy values were then used to generate cell neighbourhood-specific weights to perform an adaptive smoothing of gene expression, such that smoothing was performed over more cells in homogeneous neighbourhoods, but heterogeneous neighbourhoods were smoothed over a much smaller region.
Figure: clustSIGNAL method overview.
The core steps involved in the method are sequential:
1. The method starts with non-spatial clustering and subclustering (default louvain clustering) to classify cells into subclusters that we refer to as "intial clusters".
Function: clustSIGNAL::p1_clustering()
2. The neighbourhood of each cell is defined in terms of their "intial clusters" composition.
Function: clustSIGNAL::neighbourDetect()
3. The cells in the neighbourhood are also sorted and rearranged so that the neighbours belonging to the same "intial clusters" group as the index cell are placed closer to it.
Function: clustSIGNAL::neighbourDetect()
4. Neighbourhood “domainness” is measured as entropy, where high entropy values indicate more heterogeneous neighbourhoods and low entropy values indicate more homogeneous neighbourhoods.
Function: clustSIGNAL::entropyMeasure()
5. The entropy values are used to generate weight distributions specific to each neighbourhood.
Function: clustSIGNAL::adaptiveSmoothing()
6. The gene expressions of cells are adaptively smoothed using the entropy-guided weight distributions; cells in heterogeneous neighbourhoods (high entropy) undergo smoothing over a smaller region, whereas cells in homogeneous neighbourhoods (low entropy) undergo smoothing over a larger region.
Function: clustSIGNAL::adaptiveSmoothing()
7. Non-spatial clustering is performed with adaptively smoothed gene expression to generate clustSIGNAL clusters that represent cell types.
Function: clustSIGNAL::p2_clustering()
The clustSIGNAL package uses a SpatialExperiment object as input. We provide users with a number of parameters to explore and experiment with, as well as prior tested default values for quick runs. ClustSIGNAL can be used for single sample or multisample analysis with just one function call. Below is the list of the parameters offered and their possible values:
-
spe - SpatialExperiment object containing cell spatial coordinates (stored in the spatialCoords(spe) cell location section) matrix and normalized counts (stored under logcounts(spe) assay) of gene expression.
-
samples - column name in cell metadata (stored in the colData(spe) section) containing sample names.
-
cells - column name in cell metadata (stored in the colData(spe) section) containing cell IDs.
-
dimRed - dimensionality reduction method name in low embedding data (stored in the reducedDimNames(spe) embeddings section). Default value is "None", in which case PCA is calculated and used as low dimension data.
-
batch - whether batch correction should be performed. Default value is FALSE.
-
batch_by - column name in cell metadata (stored in the colData(spe) section) containing the groups to use for batch correction.
-
NN - neighbourhood size in terms of the number of nearest neighbours to consider. Value should be > 1. Default value is 30.
-
kernel - type of weight distribution to use. Can be Gaussian (default) or exponential distribution.
-
spread - value of distribution parameter - standard deviation of Gaussian distribution or rate of exponential distribution. Default value is 0.05, recommended for Gaussian distribution. For exponential distribution, recommended value is 20.
-
sort - whether cell neighbourhoods should be sorted by their 'initial clusters' grouping. Default value is True.
-
threads - number of cores to use for parallel runs. Default value is 1.
-
outputs - choice of output types. Default value is 'c' for data frame of cell IDs and cluster numbers. Other possible value is "a" for a list of dataframe of clusters plus final SpatialExperiment object.
-
clustParams - parameter options for TwoStepParam clustering methods in the bluster package. The clustering parameters are in the order - centers (centers) for clustering with KmeansParam, centers (centers) for sub-clustering clusters with KmeansParam, maximum iterations (iter.max) for clustering with KmeansParam, k values (k) for clustering with NNGraphParam, and community detection method (cluster.fun) to use with NNGraphParam.