procedure1_cluster_replicability.Rmd

---
title: "Scaling up reproducible research for single cell transcriptomics using MetaNeighbor (Procedure 1)"
output:
  pdf_document:
    fig_caption: yes
graphics: yes
---

```{r setup, include = FALSE}
knit_print.data.frame = function(x, ...) {
  res = paste(c("", "", knitr::kable(x, digits=2, row.names=FALSE)), collapse = "\n")
  knitr::asis_output(res)
}

registerS3method(
  "knit_print", "data.frame", knit_print.data.frame,
  envir = asNamespace("knitr")
)
```

# Procedure 1: assessment of cell type replicability with unsupervised MetaNeighbor

Procedure 1 demonstrates how to compute and visualize cell type replicability across 4 human pancreas datasets. It shows steps detailing how to download and reformat the datasets with the SingleCellExperiment package, how to compute and interpret MetaNeighbor AUROCs. 

## Creation of a merged SingleCellExperiment dataset (1-2 minutes)

1. We consider 4 pancreatic datasets along with their independent annotation (from the original publications). MetaNeighbor expects a gene-by-cell matrix encapsulated in a SummarizedExperiment format. We recommend the SingleCellExperiment (SCE) package, because it is able to handle sparse matrix formats. Load the pancreas datasets using the scRNAseq package, which provide annotated datasets that are already in the SingleCellExperiment format:

```{r load_pancreas, message=FALSE}
library(scRNAseq)
my_data <- list(
    baron = BaronPancreasData(),
    lawlor = LawlorPancreasData(),
    seger = SegerstolpePancreasData(),
    muraro = MuraroPancreasData()
)
```

Note that Seurat objects can easily be converted into SingleCellExperiment objects by using the `as.SingleCellExperiment` function for Seurat v3 objects and `Convert(from = seurat_object, to = "sce")` for Seurat v2 objects.

2. MetaNeighbor's `mergeSCE` function can be used to merge multiple SingleCellExperiment objects. Importantly, the output object will be restricted to genes, metadata columns and assays that are common to all datasets. Before using `mergeSCE`, make sure that gene and metadata information aligns across datasets.

Start by checking if gene information aligns (stored in the `rownames` slot of the SCE object):

```{r check_genes}
lapply(my_data, function(x) head(rownames(x), 3))
```
Two datasets (Baron, Segerstolpe) use gene symbols, one dataset (Muraro) combines symbols with chromosome information (to avoid duplicate gene names) and the last dataset (Lawlor) uses Ensembl identifiers. Here we will convert all gene names to unique gene symbols. Start by converting gene names in the Muraro dataset by using the symbols stored in the `rowData` slot of the SCE object, and remove all duplicated gene symbols:

```{r convert_genes_muraro}
rownames(my_data$muraro) <- rowData(my_data$muraro)$symbol
my_data$muraro <- my_data$muraro[!duplicated(rownames(my_data$muraro)),]
```
Next, convert Ensembl IDs to gene symbols in the Lawlor dataset, removing all IDs with no match and all duplicated symbols:

```{r convert_genes_lawlor, message=FALSE}
library(org.Hs.eg.db)

symbols <- mapIds(org.Hs.eg.db, keys=rownames(my_data$lawlor),
                  keytype="ENSEMBL", column="SYMBOL")
keep <- !is.na(symbols) & !duplicated(symbols)
my_data$lawlor <- my_data$lawlor[keep,]
rownames(my_data$lawlor) <- symbols[keep]
```

3. We now turn our attention to metadata, which are stored in the `colData` slot of the SCE objects. Here, make sure that the column that contains cell type information is labeled identically in all datasets:

```{r check_metadata}
lapply(my_data, function(x) colnames(colData(x)))
```

Two datasets have the cell type information in the "cell type" column, the other two in the "label" column. Add a "cell type" column in the latter two datasets.

```{r convert_metadata}
my_data$baron$"cell type" <- my_data$baron$label
my_data$muraro$"cell type" <- my_data$muraro$label
```

4. Last, check that count matrices, stored in the `assays` slot, have identical names.

```{r check_assays}
lapply(my_data, function(x) names(assays(x)))
```

The count matrices are all stored in an assay named "counts", no change is needed here.

5. Now that gene, cell type and count matrix information is aligned across datasets, create a merged dataset using `mergeSCE`, which takes a list of SCE objects as an input and outputs a single SCE object:

```{r merge_pancreas}
library(MetaNeighbor)

fused_data = mergeSCE(my_data)
dim(fused_data)
head(colData(fused_data))
```

The new dataset contains 15,295 common genes, 15,793 cells and two metadata columns: a concatenated "cell type" column, and "study_id", a column created by `mergeSCE` containing the name of the original studies (corresponding to the names provided in the "my_data" list).

6. To obtain a cursory overview of cell type composition by study, cross-tabulate cell type annotations by study IDs:

```{r cross_tabulate_pancreas}
table(fused_data$"cell type", fused_data$study_id)
```

Most cell types are present in all datasets, so we expect MetaNeighbor to find multiple high confidence matches across datasets. There are slight typographic differences in cell type annotations (ductal/Ductal), but we recommend keeping the author annotations at this stage. The only procedure that requires identical annotations across datasets is Procedure 3, where we perform functional characterization of replicating cell types.

7. To avoid having to recreate the merged object, save it as an RDS file:

```{r save_pancreas}
saveRDS(fused_data, "merged_pancreas.rds")
```

The remaining sections of the procedure can be run at a later time in a new R session.

## Hierarchical cell type replicability analysis (1 minute)

8. Start by loading the MetaNeighbor (analysis) and the SingleCellExperiment (data handling) libraries, as well as the previously created pancreas dataset:

```{r load_merged_pancreas}
library(MetaNeighbor)
library(SingleCellExperiment)

pancreas_data = readRDS("merged_pancreas.rds")
```

9. To perform neighbor voting and identify replicating cell types, MetaNeighbor builds a cell-cell similarity network, which we defined as the Spearman correlation over a user-defined set of genes. We found that we obtained best results by picking genes that are highly variable across datasets, which can be picked using the `variableGenes` function. Select highly variable genes for the pancreas datasets:

```{r compute_global_hvgs}
global_hvgs = variableGenes(dat = pancreas_data, exp_labels = pancreas_data$study_id)
length(global_hvgs)
```

The function returns a list of 600 genes that were detected as highly variable in each of the 4 datasets. In our experience, we obtained best and robust performance for gene sets ranging from 200 to 1,000 variable genes. In general, using a larger number of datasets selects robustly varying genes, enabling high performance with a smaller number of genes. However, if `variableGenes` returns a gene set that is too small (in particular when you are comparing a large number of datasets), the number of genes can be increased by setting the "min_recurrence" parameter. For example, by setting "min_recurrence=2", we keep all genes that are highly variable in at least 2 ot of the 4 datasets. Additionally, genes are sorted by relevance in the latest version of MetaNeighbor (available on Github and Bioconductor version 3.13), so it is always possible to select a smaller number of genes. For example, global_hvgs[1:500] selects the top 500 highly variable genes that are recurrent across all 4 datasets.

10. The merged dataset and a set of biological meaningful genes is that is needed to run MetaNeighbor and obtain cell type similarities. Because the dataset is large (> 10k cells), run the fast implementation of MetaNeighbor ("fast_version=TRUE"):

```{r compute_global_aurocs}
aurocs = MetaNeighborUS(var_genes = global_hvgs,
                        dat = pancreas_data,
                        study_id = pancreas_data$study_id,
                        cell_type = pancreas_data$"cell type",
                        fast_version = TRUE)
```

`MetaNeighborUS` returns a cell-type-by-cell-type matrix containing cell type similarities. Cell type similarities are defined as an Area Under the ROC curve (AUROC), which range between 0 and 1, where 0 indicates low similarity and 1 high similarity. Note that the "fast_version=TRUE" parameter uses a slightly simplified version of MetaNeighbor that is significantly faster and more memory efficient. It should always be used on large datasets (> 10k cells), but can also be run on smaller datasets and yields equivalent results to the original MetaNeighbor algorithm.

11. For ease of interpretation, visualize AUROCs as a heatmap, where rows and columns are cell types from all the datasets:

```{r plot_global_aurocs, fig.width=7, fig.height=6.5}
plotHeatmap(aurocs, cex = 0.5)
```

In the heatmap, the color of each square indicates the proximity of a pair of cell types, ranging from blue (low similarity) to red (high similarity). For example, "baron|gamma" (2nd row) is highly similar to "seger|gamma" (3rd column from the right) but very dissimilar from "muraro|duct" (middle column). To group similar cell types together, `plotHeatmap` applies hierarchical clustering on the AUROC matrix. On the heatmap, we see two large red blocks that indicate hierarchical structure in the data, with endocrine cell types clustering together (e.g., alpha, beta, gamma) and non-endocrine cells on the other side (e.g., amacrine, ductal, endothelial). Note that each red block is composed of smaller red blocks, indicating that cell types can be matched at an even higher resolution. The presence of off-diagonal patterns (e.g., "lawlor|Gamma/PP", "lawlor|Delta") suggests the presence of doublets or contamination, but the heatmap is dominated by the clear presence of red blocks, which is a strong indicator of replicability.

In the latest version of MetaNeighbor (available on Github and Bioconductor 3.13), we increased the flexibility of heatmaps. `plotHeatmap` internally relies on `gplots::heatmap.2`: you can pass any valid `heatmap.2` parameter to `plotHeatmap`, for example the “ColSideColors” parameter can be used to annotate the columns of the heatmap (one color by dataset). Alternatively, the `MetaNeighbor::ggPlotHeatmap` function returns a customizable ggplot2 object.

12. To identify pairs of replicable cell types, run the following function:

```{r show_global_top_hits}
topHits(aurocs, dat = pancreas_data, study_id = pancreas_data$study_id,
        cell_type = pancreas_data$"cell type", threshold = 0.9)
```

`topHits` relies on a simple heuristic: a pair of cell type is replicable if they are reciprocal top hits (they preferentially vote for each other) and the AUROC exceeds a given threshold value (in our experience, 0.9 is a good heuristic value). We find a long list of replicable endocrine cell types (e.g., epsilon, alpha and beta cells) and non-endocrine cell types (e.g. mast, endothelial or acinar cells). This list provides strong evidence that these cell types are robust, as they are identified across all datasets with high AUROC.

13. In the case where there is a clear structure in the data (endocrine vs non-endocrine here), we can refine AUROCs by splitting the data. AUROCs have a simple interpretation: an AUROC of 0.6 indicates that cells from a given cell type are ranked in front of 60% of other target cells. However, this interpretation is outgroup dependent: because endocrine cells represent ~65% of cells, even an unrelated pair of non-endocrine cell types will have an AUROC > 0.65, because non-endocrine cells will always be ranked in front of endocrine cells.

By starting with the full datasets, we uncovered the global structure in the data (endocrine vs non-endocrine). However, to evaluate replicability of endocrine cell types and reduce dataset composition effects, we can make the assessment more stringent by restricting the outgroup to close cell types, i.e. by keeping only endocrine subtypes. Split cell types in two by using the `splitClusters` function and retain only endocrine cell types:

```{r split_global_clusters}
level1_split = splitClusters(aurocs, k = 2)
level1_split
first_split = level1_split[[2]]
```
By outputting "level1_split", we found that the cell types were nicely split between non-endocrine and endocrine, and that endocrine cell types where in the second element of the list. Note that `splitClusters` applies a simple hierarchical clustering algorithm to separate cell types, cell types can be selected manually in more complex scenarios.

14. We repeat the MetaNeighbor analysis on endocrine cells only. First, subset the data to the endocrine cell types that were previously stored in "first_split":

```{r subset_global_data}
full_labels = makeClusterName(pancreas_data$study_id, pancreas_data$"cell type")
subdata = pancreas_data[, full_labels %in% first_split]
dim(subdata)
```

The new dataset contains the 9341 putative endocrine cells.

15. To focus on variability that is specific to endocrine cells, re-pick highly variable genes:

```{r compute_endocrine_hvgs}
var_genes = variableGenes(dat = subdata, exp_labels = subdata$study_id)
```

16. Finally, recompute cell type similarities and visualize AUROCs:

```{r compute_endocrine_aurocs, fig.width=7, fig.height=6.5}
aurocs = MetaNeighborUS(var_genes = var_genes,
                        dat = subdata, fast_version = TRUE,
                        study_id = subdata$study_id,
                        cell_type = subdata$"cell type")
plotHeatmap(aurocs, cex = 0.7)
```

The resulting heatmap illustrates an example of a strong set of replicating cell types: when the assessment becomes more stringent (restriction to closely related cell types), the similarity of replicating cell types remains strong (AUROC~1 for alpha, beta, gamma, delta and epsilon cells) while the cross-cell-type similarity decreases (shift from red to blue, e.g. similarity of alpha and beta cell types has shifted from orange/red in the global heatmap to dark blue in the endocrine heatmap) by virtue of zooming in on a subpart of the dataset.

17. We can continue to zoom in as long as there are at least two cell types per dataset. Repeat the previous steps to split the endocrine cell types:

```{r split_endocrine, fig.width=7, fig.height=6.5}
level2_split = splitClusters(aurocs, k = 3)
my_split = level2_split[[3]]
subdata = pancreas_data[, full_labels %in% my_split]
var_genes = variableGenes(dat = subdata, exp_labels = subdata$study_id)
length(var_genes)
aurocs = MetaNeighborUS(var_genes = var_genes,
                        dat = subdata, fast_version = TRUE,
                        study_id = subdata$study_id,
                        cell_type = subdata$"cell type")
plotHeatmap(aurocs, cex = 1)
```

Here we remove the alpha and beta cells (representing close to 85% of endocrine cells) and validate that, even when restricting to neighboring cell types, there is still a clear distinction between delta, gamma and epsilon cells (AUROC ~ 1).


## Stringent assessment of replicability with one-vs-best AUROCs (1 minute)

In the previous section, we created progressively more stringent replicability assessments by selecting more and more specific subsets of related cell types. As an alternative, we provide the "one_vs_best" parameter, which offers similar results without having to restrict the dataset manually. In this scoring mode, MetaNeighbor will automatically identify the two closest matching cell types in each target dataset and compute an AUROC based on the voting result for cells from the closest match against cells from the second closest match. Essentially, we are asking how easily a cell type can be distinguished from its closest neighbor.

18. To obtain one-vs-best AUROCs, run the same command as before with two additional parameters: "one_vs_best = TRUE" and "symmetric_output = FALSE":

```{r compute_1v1_aurocs, fig.width=7, fig.height=6.5}
best_hits = MetaNeighborUS(var_genes = global_hvgs,
                           dat = pancreas_data,
                           study_id = pancreas_data$study_id,
                           cell_type = pancreas_data$"cell type",
                           fast_version = TRUE,
                           one_vs_best = TRUE, symmetric_output = FALSE)
plotHeatmap(best_hits, cex = 0.5)
```

The interpretation of the heatmap is slightly different compared to one-vs-all AUROCs. First, since we only compare the two closest cell types, most cell type combinations are not tested (NAs, shown in gray on the heatmap). Second, by setting "symmetric_output=FALSE", we broke the symmetry of the heatmap: reference cell types are shown as columns and target cell types are shown as rows. Since each cell type is only tested against two cell types in each target dataset (closest and second closest match), we have 8 values per column (2 per dataset).

This representation helps to rapidly identify a cell type's closest hits as well as their closest outgroup. For example, ductal cells (2nd red square from the top right) strongly match with each other (one-vs-best AUROC>0.8) and acinar cells are their closest outgroup (blue segments in the same column). The non-symmetric view makes it clear when best hits are not reciprocal. For example, mast cells (first two columns) heavily vote for "lawlor|Stellate" and "muraro|mesenchymal", but this vote is not reciprocal. This pattern indicates that the mast cell type is missing in the Lawlor and Muraro datasets: because mast cells have no natural match in these datasets, they vote for the next closest cell type (stellate cells). The lack of reciprocity in voting is an important tool to detect imbalances in dataset composition.

19. When using one-vs-best AUROCs, we recommend extracting replicating cell types as meta-clusters. Cell types are part of the same meta-cluster if they are reciprocal best hits. Note that if cell type A is the reciprocal best hit of B and C, all three cell types are part of the same meta-cluster, even if B and C are not reciprocal best hits. To further filter for strongly replicating cell types, we specify an AUROC threshold (in our experience, 0.7 is a strong one-vs-best AUROC threshold). To extract meta-clusters and summarize the strength of each meta-cluster, run the following functions:

```{r extract_metaclusters}
mclusters = extractMetaClusters(best_hits, threshold = 0.7)
mcsummary = scoreMetaClusters(mclusters, best_hits)
```

The `scoreMetaClusters` provides a good summary of meta-clusters, ordering cell types by the number of datasets in which they replicate, then by average AUROC. We find 12 cell types that have strong support across at least 2 datasets, with 7 cell types replicating across all 4 datasets. 8 cell types are tagged as "outlier", indicating they had no strong match in any other dataset. These cell types usually contain doublets, low quality cells or contaminated cell types. To rapidly visualize the number of robust cell types, the replicability structure can be summarized as an Upset plot with the `plotUpset` function.

```{r plot_upset, fig.width=7, fig.height=6.5}
plotUpset(mclusters)
```
To further investigate the robustness of meta-clusters, they can be visualized as heatmaps (called "cell-type badges") with the `plotMetaClusters` function. Because the function generates one heatmap per meta-cluster, save the output to a PDF file to facilitate investigation:

```{r plot_meta_clusters}
pdf("meta_clusters.pdf")
plotMetaClusters(mclusters, best_hits)
dev.off()
```

Each badge shows an AUROC heatmap restricted to one specific meta-cluster. These badges help diagnose cases where AUROCs are lower in a specific reference or target dataset. For example, the "muraro|duct" cell type has systematically lower AUROCs, suggesting the presence of contaminating cells in another cell type (probably in the "muraro|unclear" cell type).

20. The last visualization is an alternative representation of the AUROC heatmap as a graph, which is particularly useful for large datasets. In this graph, top votes (AUROC > 0.5) are shown in gray, while outgroup votes (AUROC < 0.5) are shown in orange. To highlight close calls, we recommend keeping only strong outgroup votes, here with AUROC >= 0.3. To build and plot the cluster graph, run the following functions:

```{r plot_cluster_graph, fig.width=7, fig.height=6.5}
cluster_graph = makeClusterGraph(best_hits, low_threshold = 0.3)
plotClusterGraph(cluster_graph, pancreas_data$study_id,
                 pancreas_data$"cell type", size_factor=3)
```

We note that there are several orange edges, indicating that some cell types had two close matches. To investigate the origin of these close calls, we can focus on a cluster of interest (coi). Take a closer look at "baron|epsilon", query its closest neighbors in the graph with `extendClusterSet`, then zoom in on its subgraph with `subsetClusterGraph`:

```{r plot_subgraph, fig.width=7, fig.height=6.5}
coi = "baron|epsilon"
coi = extendClusterSet(cluster_graph, initial_set=coi, max_neighbor_distance=2)
subgraph = subsetClusterGraph(cluster_graph, coi)
plotClusterGraph(subgraph, pancreas_data$study_id,
                 pancreas_data$"cell type", size_factor=5)
```

In the "baron|epsilon" case, we find that the epsilon cell type is missing in the Lawlor dataset, so there is no natural match for the Baron epsilon cell type. In such cases, votes are frequently non-reciprocal and equally split between two unrelated cell types, here "Lawlor|Gamma/PP" and "Lawlor|Alpha". In general, the cluster graph can be used to understand how meta-clusters are extracted, why some clusters are tagged as outliers and diagnose problems where the resolution of cell types differs across datasets.