Fix plotting na color bug; improve na color order (at bottom); Refine…

… cross species vignette
mvfki · Dec 14, 2023 · 222bb2c · 222bb2c
1 parent 4245bb2
commit 222bb2c
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 57 deletions.
diff --git a/R/ggplotting.R b/R/ggplotting.R
@@ -217,11 +217,13 @@ plotCellScatter <- function(
     dotOrder <- match.arg(dotOrder)
     set.seed(seed)
     raster <- .checkRaster(nrow(plotDF), raster)
-
     if (!is.null(colorBy)) {
         if (dotOrder == "shuffle") {
-            idx <- sample(nrow(plotDF))
-            plotDF <- plotDF[idx, ]
+            # Always put NA at bottom layer, i.e. plot them first
+            isNA <- which(is.na(plotDF[[colorBy]]))
+            nonNA <- which(!is.na(plotDF[[colorBy]]))
+            idx <- sample(nonNA)
+            plotDF <- plotDF[c(isNA, idx), ]
         } else if (dotOrder == "ascending") {
             plotDF <- plotDF[order(plotDF[[colorBy]], decreasing = FALSE,
                                    na.last = FALSE),]
@@ -749,7 +751,8 @@ plotCellViolin <- function(
                     .setColorLegendPalette(plot$data[[varName]],
                                            aesType = a,
                                            labels = colorLabels,
-                                           values = colorValues)
+                                           values = colorValues,
+                                           naColor = naColor)
             }
         } else {
             # continuous setting
@@ -810,7 +813,8 @@ plotCellViolin <- function(
         fct,
         aesType = c("colour", "fill"),
         labels = NULL,
-        values = NULL
+        values = NULL,
+        naColor = "#F5F5F5"
 ) {
     aesType <- match.arg(aesType)
     layer <- NULL
@@ -820,10 +824,12 @@ plotCellViolin <- function(
     if (!is.null(labels) && !is.null(values)) {
         if (aesType == "colour") {
             layer <- ggplot2::scale_color_manual(values = values,
-                                                 labels = labels)
+                                                 labels = labels,
+                                                 na.value = naColor)
         } else {
             layer <- ggplot2::scale_fill_manual(values = values,
-                                                labels = labels)
+                                                labels = labels,
+                                                na.value = naColor)
         }
     }
     return(layer)

diff --git a/man/liger-class.Rd b/man/liger-class.Rd
diff --git a/vignettes/articles/cross_species_vig.Rmd b/vignettes/articles/cross_species_vig.Rmd
@@ -5,110 +5,125 @@ date: "12/6/2021"
 output: html_document
 ---
 
-```{r eval = FALSE}
-knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, results = "hide")
+```{r setup, echo=FALSE}
+knitr::opts_chunk$set(warning = FALSE, message = FALSE)
 ```
 
-```{r setup, include=FALSE}
-library(rliger2)
-mouse <- readRDS("../../cross_species_vignette/Drop_mouse.RDS")
-lizard <- readRDS("../../cross_species_vignette/RNA_lizard.RDS")
-lizard_annies <- readRDS("../../cross_species_vignette/lizard_labels.RDS")
-mouse_annies <- readRDS("../../cross_species_vignette/Dropviz_general_annotations.RDS")
-
-```
 
 In this vignette, we demonstrate how to integrate dataset from different species. 
 
-## Step 1: Download the data
+## Step 1: Load the data
 
-For this tutorial, we will use two matrices, which can all be downloaded at https://www.dropbox.com/sh/y9kjoum8u469nj1/AADik2b2-Qo3os2QSWXdIAbna?dl=0 .
+### 1. Load data and create liger object
+
+For this tutorial, we will use two datasets, which can all be downloaded at https://www.dropbox.com/sh/y9kjoum8u469nj1/AADik2b2-Qo3os2QSWXdIAbna?dl=0 .
 
 - The scRNA mouse dataset (Drop_mouse.RDS) is 28,366 genes by 71,639 cells.
 - The scRNA lizard dataset (RNA_lizard.RDS) is 15,345 genes by 4,202 cells. 
 
-```{r eval = FALSE}
+The input raw count matrices are presented in dgCMatrix class objects, which is a common form of sparse matrix. Users can then create a `liger` object with the two datasets. A named list is required for submitting the datasets to the object creator. 
+
+```{r readRDS}
 library(rliger2)
-mouse <- readRDS("Drop_mouse.RDS")
-lizard <- readRDS("RNA_lizard.RDS")
+options(ligerDotSize = 0.5)
+
+mouse <- readRDS("cross_species_vignette_data/Drop_mouse.RDS")
+lizard <- readRDS("cross_species_vignette_data/RNA_lizard.RDS")
+
+lig <- createLiger(list(mouse = mouse, lizard = lizard))
 ```
 
 ## Step 2: Preprocessing and normalization
 
-First, we create a `liger` object as the container of the analysis. A named list is required for submitting the datasets to the object creator. In a liger object, the raw matrices will not be merged but stored in separated dataset-specific containers.
+### 2. Normalization
 
-```{r createliger}
-lig <- createLiger(list(mouse = mouse, lizard = lizard))
-```
-Normalize the datasets. Liger simply normalizes the matrices by library size, without multiplying a scale factor or applying log1p transformation. 
+Liger simply normalizes the matrices by library size, without multiplying a scale factor or applying "log1p" transformation. 
 
 ```{r normalize}
 lig <- normalize(lig)
 ```
 
-Select shared, homologous genes between the two species, as well as unshared, non-homologous genes from the lizard dataset. The default setting of `selectGenes()` function selects for homologous genes shared between all datasets. To enable selection of unshared genes, users need to turn `unshared = TRUE` and specify the `unshared.datasets`.
+### 3. Select variable genes
+
+For cross-species analysis, we select shared, homologous genes between the two species, as well as unshared, non-homologous genes from the lizard dataset. The default setting of `selectGenes()` function selects for homologous genes shared between all datasets. To enable selection of unshared genes, users need to specify the name(s) of dataset(s) where unshared genes should be chosen from to `useUnsharedDatasets`. 
 
-```{r selectvargenes}
-lig <- selectGenes(lig, var.thres = 0.3, unshared = TRUE, unshared.datasets = "lizard", unshared.thresh = 0.3)
+```{r selectvargenes, message=TRUE}
+lig <- selectGenes(lig, var.thres = 0.3, useUnsharedDatasets = "lizard", unsharedThresh = 0.3)
 ```
 
+### 4. Scale not center
+
 Then we scale the dataset. Three new matrices will be created under the hook, two containing the shared variable features for both datasets, and one for the unshared variable features of lizard data. Note the Liger does not center the scaled data because iNMF/UINMF methods require non-negative input. 
 
 ```{r scale}
 lig <- scaleNotCenter(lig)
 ```
 
-
 ## Step 3: Joint Matrix Factorization
 
-`optimizeALS()` function allows users to factorize the datasets. Setting `useUnshared = TRUE` enables running UINMF method with the scaled unshared feature matrix.
+### 5. Mosaic integrative NMF
+
+Unshared Integrative Non-negative Matrix Factorization (UINMF) can be applied with `runIntegration(..., method = "UINMF")`. A standalone function `runUINMF()` is also provided with more detailed documentation and initialization setup. This step produces factor gene loading matrices of the shared genes: $W$ for shared information across datasets, and $V$ for dataset specific information. Specific to UINMF method, additional factor gene loading matrix of unshared genes, $U$ is also produced. $H$ matrices, the cell factor loading matrices are produced for each dataset and can be interpreted as load rank representation of the cells.
 
 ```{r factorization}
-lig <- optimizeALS(lig, k = 30, useUnshared = TRUE, thresh = 1e-10)
+lig <- runIntegration(lig, k = 30, method = "UINMF")
 ```
 
 ## Step 4: Quantile Normalization and Joint Clustering
 
-After factorization, the resulting liger object can used in all downstream LIGER functions without adjustment. The default reference dataset for quantile normalization is the larger dataset, but the user should select the higher quality dataset as the reference dataset, even if it is the smaller dataset. In this case, the mouse dataset is considered higher quality than the lizard dataset, so we set the mouse dataset to be the reference dataset.
+### 6. Quantile normalization
+
+ The default reference dataset for quantile normalization is the largest dataset, but the user should select the higher quality dataset as the reference dataset, even if it is a smaller dataset. In this case, the mouse dataset is considered higher quality than the lizard dataset, so we set the mouse dataset to be the reference dataset.
+After this step, the low-rank cell factor loading matrices, $H$, are aligned and ready for cluster definition. 
 
 ```{r quantilenorm}
 lig <- quantileNorm(lig, reference = "mouse")
-lig <- runLeidenCluster(lig, nNeighbors = 30, resolution = 0.6)
+```
+
+### 7. Leiden clustering
+
+With the aligned cell factor loading information, we next apply Leiden community detection algorithm to identify cell clusters.
+
+```{r leiden}
+lig <- runCluster(lig, nNeighbors = 30, resolution = 0.6)
 ```
 
 ## Step 5: Visualization
 
+### 8. Dimensionality reduction
+
+We create a UMAP with the quantile normalized cell factor loading.
+
 ```{r runumap}
-lig <- runUMAP(lig, nNeighbors = 30, distance = "cosine", minDist = 0.3)
+lig <- runUMAP(lig, nNeighbors = 30, minDist = 0.3)
 ```
 
-Next, we can visualize our returned factorized object by dataset to check the alignment between datasets. 
+### 9. Plot UMAP
 
-```{r visualizations}
-options(ligerDotSize = 0.3)
-plotByDatasetAndCluster(lig, combinePlots = FALSE)
+Next, we can visualize our returned factorized object by dataset to check the integration between datasets, and also the cluster labeling on the global population.
+
+```{r visualizations, fig.width=7, fig.height=6}
+plotDatasetDimRed(lig)
+plotClusterDimRed(lig, legendNCol = 2)
 ```
 
 We can also use the datasets' original annotations to check the correspondence between the cell types of the two species. The annotations can be downloaded at https://www.dropbox.com/sh/y9kjoum8u469nj1/AADik2b2-Qo3os2QSWXdIAbna?dl=0
 
-```{r annies, eval = FALSE}
-mouse_annies = readRDS("Dropviz_general_annotations.RDS")
-lizard_annies = readRDS("lizard_labels.RDS")
+```{r annies}
+mouse_annies = readRDS("cross_species_vignette_data/Dropviz_general_annotations.RDS")
+lizard_annies = readRDS("cross_species_vignette_data/lizard_labels.RDS")
 ```
 
-Insert the original annotation into the liger object by simply using the `$` syntax, and then it can be used in any downstream operation where a cluster variable is required. 
+`cellMeta()<-` method has the feature for partial insertion implemented and is great for adding dataset specific metadata to a new variable. If you are unsure about the order between given variable and the cells in the object, argument `cellIdx` can be used for precise locating. Users might use different strategies when creating the index depending on different situation.
+
+```{r addAnn, fig.width=7, fig.height=6}
+mouse_id <- paste0("mouse_", names(mouse_annies))
+cellMeta(lig, "mouse_ann", cellIdx = mouse_id) <- factor(mouse_annies)
 
-```{r addAnn}
-mouse_annies <- mouse_annies[colnames(mouse)]
-lizard_annies <- lizard_annies[colnames(lizard)]
-lig$mouse_ann <- NA
-lig$mouse_ann[lig$dataset == "mouse"] <- mouse_annies
-lig$mouse_ann <- factor(lig$mouse_ann)
-lig$lizard_ann <- NA
-lig$lizard_ann[lig$dataset == "lizard"] <- lizard_annies
-lig$lizard_ann <- factor(lig$lizard_ann)
+lizard_id <- paste0("lizard_", names(lizard_annies))
+cellMeta(lig, "lizard_ann", cellIdx = lizard_id) <- factor(lizard_annies)
 
-plotClusterDimRed(lig, useCluster = "mouse_ann")
-plotClusterDimRed(lig, useCluster = "lizard_ann")
+plotClusterDimRed(lig, useCluster = "mouse_ann", legendNCol = 1)
+plotClusterDimRed(lig, useCluster = "lizard_ann", legendNCol = 1)
 ```