updated README and tweaked viz tools

jr-leary7 · Sep 20, 2023 · 7567015 · 7567015
1 parent c870ab2
commit 7567015
Show file tree

Hide file tree

Showing 13 changed files with 261 additions and 196 deletions.
diff --git a/NAMESPACE b/NAMESPACE
@@ -17,6 +17,7 @@ export(sortGenesHeatmap)
 export(stripGLM)
 export(testDynamic)
 export(testSlope)
+export(theme_scLANE)
 export(waldTestGEE)
 import(RcppEigen)
 import(glm2)
@@ -62,6 +63,7 @@ importFrom(gamlss,pb)
 importFrom(gamlss,random)
 importFrom(geeM,geem)
 importFrom(ggplot2,aes)
+importFrom(ggplot2,element_rect)
 importFrom(ggplot2,element_text)
 importFrom(ggplot2,facet_wrap)
 importFrom(ggplot2,geom_line)

diff --git a/NEWS.md b/NEWS.md
@@ -2,6 +2,7 @@
 
 * Added a function named `sortGenesHeatmap()` that aids in the creation of expression cascade heatmaps by sorting genes according to where in pseudotime their peak expression is. 
 * Changed the parameter `approx.knot` in the `testDynamic()` function to use (stochasticity-controlled) subsampling instead of `seq()` to reduce candidate knot space. 
+* Added `summarizeModels()` to sum up slopes across pseudotime intervals
 
 # scLANE 0.7.1
 

diff --git a/R/plotModels.R b/R/plotModels.R
@@ -29,7 +29,7 @@
 #' @param plot.gam (Optional) Should the fitted values from an NB GAM be plotted? Defaults to TRUE.
 #' @param plot.scLANE (Optional) Should the fitted values from the \code{scLANE} model be plotted? Defaults to TRUE.
 #' @param filter.lineage (Optional) A character vector of lineages to filter out before generating the final plot. Should be letters, i.e. lineage "A" or "B". Defaults to NULL.
-#' @param gg.theme (Optional) A \code{ggplot2} theme to be added to the plot. Defaults to \code{theme_classic(base_size = 14)}.
+#' @param gg.theme (Optional) A \code{ggplot2} theme to be added to the plot. Defaults to \code{\link{theme_scLANE}}.
 #' @return A \code{ggplot} object.
 #' @export
 #' @examples
@@ -70,9 +70,7 @@ plotModels <- function(test.dyn.res = NULL,
                        plot.gam = TRUE,
                        plot.scLANE = TRUE,
                        filter.lineage = NULL,
-                       gg.theme = ggplot2::theme_classic(base_size = 14,
-                                                         base_line_size = 0.75,
-                                                         base_rect_size = 0.75)) {
+                       gg.theme = theme_scLANE()) {
   # check inputs
   if (is.null(expr.mat) || is.null(pt) || is.null(gene) || is.null(test.dyn.res)) { stop("You forgot one or more of the arguments to plotModels().") }
   # get raw counts from SingleCellExperiment or Seurat object & transpose to cell x gene dense matrix

diff --git a/R/theme_scLANE.R b/R/theme_scLANE.R
@@ -0,0 +1,30 @@
+#' A \code{ggplot2} theme for \code{scLANE}.
+#'
+#' @name theme_scLANE
+#' @author Jack Leary
+#' @importFrom ggplot2 theme_classic theme element_rect
+#' @description A publication-ready theme for creating gene dynamics plots, embedding plots, etc.
+#' @param base.size The base font size. Defaults to 12.
+#' @param base.lwd The base linewidth. Defaults to 0.75.
+#' @param base.family The font family to be used throughout. Defaults to "sans".
+#' @return A \code{ggplot2} theme.
+#' @export
+#' @examples
+#' \dontrun{
+#' plotModels(gene_stats,
+#'            gene = "CD14",
+#'            pt = pt_df,
+#'            expr.mat = count_mat
+#' ) + theme_scLANE()
+#' }
+
+theme_scLANE <- function(base.size = 12,
+                         base.lwd = 0.75,
+                         base.family = "sans") {
+  ggplot2::theme_classic(base_size = base.size,
+                         base_family = base.family,
+                         base_line_size = base.lwd,
+                         base_rect_size = base.lwd) +
+    ggplot2::theme(strip.clip = "off",
+                   strip.background = ggplot2::element_rect(linewidth = base.lwd))
+}
diff --git a/README.Rmd b/README.Rmd
@@ -86,11 +86,18 @@ sim_data <- simulate_multi_subject(ref.dataset = panc,
                                    gene.dyn.threshold = 2)
 ```
 
-The PCA embeddings show us a pretty simple trajectory that's strongly correlated with the first principal component, and we can also see that the trajectory is conserved cleanly across subjects. 
+The PCA embeddings show us a pretty simple trajectory that's strongly correlated with the first principal component. 
 
-```{r plot-sims, results='hold'}
-plotPCA(sim_data, colour_by = "subject")
-plotPCA(sim_data, colour_by = "cell_time_normed")
+```{r plot-sims-pt, results='hold'}
+plotPCA(sim_data, colour_by = "cell_time_normed") + 
+  theme_scLANE()
+```
+
+We also see that the data are not clustered by subject, which indicates that gene dynamics are mostly homogeneous across subjects. 
+
+```{r plot-sims-subj, results='hold'}
+plotPCA(sim_data, colour_by = "subject") + 
+  theme_scLANE()
 ```
 
 ## Trajectory DE Testing
@@ -110,19 +117,19 @@ cell_offset <- createCellOffset(sim_data)
 Running `testDynamic()` provides us with a nested list containing model output & DE test results for each gene over each pseudotime / latent time lineage. In this case, since we have a true cell ordering we only have one lineage. Parallel processing is turned on by default, and we use 2 cores here to speed up runtime. 
 
 ```{r glm-models, results='hold'}
-de_test_glm <- testDynamic(expr.mat = sim_data, 
-                           pt = order_df, 
-                           genes = gene_sample, 
-                           size.factor.offset = cell_offset, 
-                           n.cores = 2, 
-                           track.time = TRUE)
+scLANE_models_glm <- testDynamic(sim_data, 
+                                 pt = order_df, 
+                                 genes = gene_sample, 
+                                 size.factor.offset = cell_offset, 
+                                 n.cores = 2, 
+                                 track.time = TRUE)
 ```
 
 After the function finishes running, we use `getResultsDE()` to generate a sorted table of DE test results, with one row for each gene & lineage. The GLM backend uses a simple likelihood ratio test to compare the null & alternate models, with the test statistic assumed to be [asymptotically Chi-squared distributed](https://en.wikipedia.org/wiki/Likelihood-ratio_test).  
 
 ```{r glm-results}
-de_res_glm <- getResultsDE(de_test_glm)
-select(de_res_glm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
+scLANE_res_glm <- getResultsDE(scLANE_models_glm)
+select(scLANE_res_glm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
   slice_head(n = 5) %>% 
   knitr::kable(format = "pipe", 
                digits = 3, 
@@ -134,11 +141,11 @@ After creating a reference table of the ground truth status of each gene - `1` d
 ```{r eval-glm}
 gene_status_df <- data.frame(gene = gene_sample, 
                              True_Gene_Status = ifelse(rowData(sim_data)[gene_sample, ]$geneStatus_overall == "Dynamic", 1, 0))
-de_res_glm <- inner_join(de_res_glm, 
-                         gene_status_df, 
-                         by = c("Gene" = "gene"))
-caret::confusionMatrix(factor(de_res_glm$Gene_Dynamic_Overall, levels = c(0, 1)), 
-                       factor(de_res_glm$True_Gene_Status, levels = c(0, 1)), 
+scLANE_res_glm <- inner_join(scLANE_res_glm, 
+                             gene_status_df, 
+                             by = c("Gene" = "gene"))
+caret::confusionMatrix(factor(scLANE_res_glm$Gene_Dynamic_Overall, levels = c(0, 1)), 
+                       factor(scLANE_res_glm$True_Gene_Status, levels = c(0, 1)), 
                        positive = "1")
 ```
 
@@ -147,22 +154,22 @@ caret::confusionMatrix(factor(de_res_glm$Gene_Dynamic_Overall, levels = c(0, 1))
 The function call is essentially the same when using the GLM backend, with the exception of needing to provide a sorted vector of subject IDs & a desired correlation structure. We also need to flip the `is.gee` flag in order to indicate that we'd like to fit estimating equations models (instead of mixed models). Since fitting GEEs is a fair bit more computationally complex than fitting GLMs, DE testing with the GEE backend takes a bit longer. Using more cores and / or running the tests on an HPC cluster (as this document is being compiled on a 2020 MacBook Pro) speeds things up considerably. Here we specify an [AR1 correlation structure](https://rdrr.io/cran/nlme/man/corAR1.html), which is the default for the GEE backend. 
 
 ```{r gee-models, results='hold'}
-de_test_gee <- testDynamic(sim_data, 
-                           pt = order_df, 
-                           genes = gene_sample[1:5], 
-                           size.factor.offset = cell_offset, 
-                           is.gee = TRUE, 
-                           id.vec = sim_data$subject, 
-                           cor.structure = "ar1", 
-                           n.cores = 2, 
-                           track.time = TRUE)
+scLANE_models_gee <- testDynamic(sim_data, 
+                                 pt = order_df, 
+                                 genes = gene_sample, 
+                                 size.factor.offset = cell_offset, 
+                                 is.gee = TRUE, 
+                                 id.vec = sim_data$subject, 
+                                 cor.structure = "ar1", 
+                                 n.cores = 2, 
+                                 track.time = TRUE)
 ```
 
 We again generate the table of DE test results. The variance of the estimated coefficients is determined using [the sandwich estimator](https://online.stat.psu.edu/stat504/lesson/12/12.3), and a Wald test is used to compare the null & alternate models. 
 
 ```{r gee-results}
-de_res_gee <- getResultsDE(de_test_gee)
-select(de_res_gee, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
+scLANE_res_gee <- getResultsDE(scLANE_models_gee)
+select(scLANE_res_gee, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
   slice_head(n = 5) %>% 
   knitr::kable("pipe", 
                digits = 3, 
@@ -172,11 +179,11 @@ select(de_res_gee, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Over
 We create the same confusion matrix as before. Empirically speaking, when the underlying dynamics don't differ much between subjects, GEEs tend to be more conservative (and thus perform slightly worse) than GLMs. This is shown below, where the GEE backend has decent accuracy, but the false negative rate is higher than that of the GLM backend. 
 
 ```{r eval-gee}
-de_res_gee <- inner_join(de_res_gee, 
-                         gene_status_df, 
-                         by = c("Gene" = "gene"))
-caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1)), 
-                       factor(de_res_gee$True_Gene_Status, levels = c(0, 1)), 
+scLANE_res_gee <- inner_join(scLANE_res_gee, 
+                             gene_status_df, 
+                             by = c("Gene" = "gene"))
+caret::confusionMatrix(factor(scLANE_res_gee$Gene_Dynamic_Overall, levels = c(0, 1)), 
+                       factor(scLANE_res_gee$True_Gene_Status, levels = c(0, 1)), 
                        positive = "1")
 ```
 
@@ -185,23 +192,23 @@ caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1))
 We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects belong to groups like Treatment & Control, and you expect the Treatment group to experience a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure. **Note**: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly. 
 
 ```{r glmm-models}
-de_test_glmm <- testDynamic(sim_data, 
-                            pt = order_df, 
-                            genes = gene_sample[1:5], 
-                            size.factor.offset = cell_offset, 
-                            n.potential.basis.fns = 3, 
-                            is.glmm = TRUE, 
-                            glmm.adaptive = TRUE, 
-                            id.vec = sim_data$subject, 
-                            n.cores = 2, 
-                            track.time = TRUE)
+scLANE_models_glmm <- testDynamic(sim_data, 
+                                  pt = order_df, 
+                                  genes = gene_sample, 
+                                  size.factor.offset = cell_offset, 
+                                  n.potential.basis.fns = 3, 
+                                  is.glmm = TRUE, 
+                                  glmm.adaptive = TRUE, 
+                                  id.vec = sim_data$subject, 
+                                  n.cores = 2, 
+                                  track.time = TRUE)
 ```
 
 Like the GLM backend, the GLMM backend uses a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject.  
 
 ```{r glmm-results}
-de_res_glmm <- getResultsDE(de_test_glmm)
-select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
+scLANE_res_glmm <- getResultsDE(scLANE_models_glmm)
+select(scLANE_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
   slice_head(n = 5) %>% 
   knitr::kable("pipe", 
                digits = 3, 
@@ -211,11 +218,11 @@ select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Ove
 The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it's more appropriate to use these more complex models if expression dynamics might differ between subjects, with the difference being that you should use the GEE backend if you're interested in population-level trends & the GLMM backend if you're interested in per-subject trends. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best. 
 
 ```{r eval-glmm}
-de_res_glmm <- inner_join(de_res_glmm, 
-                          gene_status_df, 
-                          by = c("Gene" = "gene"))
-caret::confusionMatrix(factor(de_res_glmm$Gene_Dynamic_Overall, levels = c(0, 1)), 
-                       factor(de_res_glmm$True_Gene_Status, levels = c(0, 1)), 
+scLANE_res_glmm <- inner_join(scLANE_res_glmm, 
+                              gene_status_df, 
+                              by = c("Gene" = "gene"))
+caret::confusionMatrix(factor(scLANE_res_glmm$Gene_Dynamic_Overall, levels = c(0, 1)), 
+                       factor(scLANE_res_glmm$True_Gene_Status, levels = c(0, 1)), 
                        positive = "1")
 ```
 
@@ -226,7 +233,7 @@ caret::confusionMatrix(factor(de_res_glmm$Gene_Dynamic_Overall, levels = c(0, 1)
 We can use the `plotModels()` to visually compare different types of modeling backends. It takes as input the results from `testDynamic()`, as well as a few specifications for which models & lineages should be plotted. While more complex visualizations can be created from our model output, this function gives us a good first glance at which models fit the underlying trend the best. Here we show the output generated using the GLM backend, split by model type. The intercept-only model shows the null hypothesis against which the scLANE model is compared using the likelihood ratio test and the GLM displays the inadequacy of monotonic modeling architectures for nonlinear dynamics. A GAM shows essentially the same trend as the `scLANE` model, though the fitted trend from `scLANE` is smoother & of course more interpretable.   
 
 ```{r plot-models-glm}
-plotModels(de_test_glm, 
+plotModels(scLANE_models_glm, 
            gene = "JARID2", 
            pt = order_df, 
            expr.mat = sim_data, 
@@ -236,8 +243,8 @@ plotModels(de_test_glm,
 When plotting the models generated using the GLMM backend, we split by lineage & color the points by subject ID instead of by lineage. 
 
 ```{r plot-models-glmm, fig.width=9, fig.height=9}
-plotModels(de_test_glmm, 
-           gene = "JARID2", 
+plotModels(scLANE_models_glmm, 
+           gene = "WAPAL", 
            pt = order_df, 
            expr.mat = sim_data, 
            size.factor.offset = cell_offset, 
@@ -251,11 +258,11 @@ plotModels(de_test_glmm,
 After generating a suitable set of models, we can cluster the genes in a semi-supervised fashion using `clusterGenes()`. This function uses the Leiden algorithm (the default), hierarchical clustering, or *k*-means and selects the best set of clustering hyperparameters using the silhouette score. If desired PCA can be run prior to clustering, but the default is to cluster on the fitted values themselves. We then use the results as input to `plotClusteredGenes()`, which generates a table of fitted values per-gene, per-lineage over pseudotime along with the accompanying cluster labels. 
 
 ```{r cluster-genes}
-gene_clusters <- clusterGenes(de_test_glm, 
+gene_clusters <- clusterGenes(scLANE_models_glm, 
                               pt = order_df, 
                               size.factor.offset = cell_offset, 
                               clust.algo = "leiden")
-gene_clust_table <- plotClusteredGenes(de_test_glm, 
+gene_clust_table <- plotClusteredGenes(scLANE_models_glm, 
                                        gene.clusters = gene_clusters, 
                                        pt = order_df, 
                                        size.factor.offset = cell_offset, 
@@ -272,15 +279,13 @@ The results can then be plotted as desired using `ggplot2` or another visualizat
 ```{r plot-clust}
 ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) + 
   facet_wrap(~paste0("Cluster ", CLUSTER)) + 
-  geom_line(alpha = 0.75) + 
+  geom_line(alpha = 0.75, show.legend = FALSE) + 
   scale_y_continuous(labels = scales::label_number(accuracy = 1)) + 
   scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) + 
   labs(x = "Pseudotime", 
        y = "Fitted Values", 
-       color = "Leiden\nCluster", 
-       title = "Unsupervised Clustering of Gene Patterns") + 
-  theme_classic(base_size = 14) + 
-  guides(color = guide_legend(override.aes = list(linewidth = 3, alpha = 1)))
+       title = "Leiden clustering of gene dynamics") + 
+  theme_scLANE()
 ```
 
 Checking the true frequency of dynamic genes in each cluster seems to somewhat confirm that hypothesis: