diff --git a/DESCRIPTION b/DESCRIPTION index 55d7f2c..b5d650c 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Package: scLANE Type: Package Title: Model gene expression dynamics with spline-based NB GLMs, GEEs, & GLMMs -Version: 0.7.0 +Version: 0.7.1 Authors@R: c(person(given = "Jack", family = "Leary", email = "j.leary@ufl.edu", role = c("aut", "cre")), person(given = "Rhonda", family = "Bacher", email = "rbacher@ufl.edu", role = c("ctb", "fnd"))) Description: This package uses truncated power basis spline models to build flexible, interpretable models of single cell gene expression over pseudotime or latent time. diff --git a/NEWS.md b/NEWS.md index d14d7c3..d3f656f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,8 @@ +# scLANE 0.7.1 + +* Changed input format of all functions to allow counts matrices formatted as `SingleCellExperiment` or `Seurat` objects, sparse matrices, or dense matrices. +* Updated visualization functions to reflect changes made in `ggplot2` v3.4 (mostly changing the `size` parameter in line-based geoms to be `linewidth` instead). + # scLANE 0.6.3 * Added a `NEWS.md` file to track changes to the package. diff --git a/README.Rmd b/README.Rmd index 109afda..857fca4 100644 --- a/README.Rmd +++ b/README.Rmd @@ -95,13 +95,12 @@ plotPCA(sim_data, colour_by = "cell_time_normed") ## Trajectory DE Testing -Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We'll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dense matrix with cells as rows & genes as columns (i.e., transposed from the way it's stored in `SingleCellExperiment` & `Seurat` objects), a dataframe containing our cell ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it's usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we'll select 50 genes each from the dynamic and non-dynamic populations. **Note**: in this case we're working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to `testDynamic()`. +Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We'll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dataframe containing our cell ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it's usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we'll select 50 genes each from the dynamic and non-dynamic populations. **Note**: in this case we're working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to `testDynamic()`. ```{r prep-data} set.seed(312) gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50), sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50)) -counts_mat <- as.matrix(t(counts(sim_data)[gene_sample, ])) order_df <- data.frame(X = sim_data$cell_time_normed) cell_offset <- createCellOffset(sim_data) ``` @@ -111,11 +110,10 @@ cell_offset <- createCellOffset(sim_data) Running `testDynamic()` provides us with a nested list containing model output & DE test results for each gene over each pseudotime / latent time lineage. In this case, since we have a true cell ordering we only have one lineage. Parallel processing is turned on by default, and we use 2 cores here to speed up runtime. ```{r glm-models, results='hold'} -de_test_glm <- testDynamic(expr.mat = counts_mat, +de_test_glm <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, - n.potential.basis.fns = 5, n.cores = 2, track.time = TRUE) ``` @@ -149,11 +147,10 @@ caret::confusionMatrix(factor(de_res_glm$Gene_Dynamic_Overall, levels = c(0, 1)) The function call is essentially the same when using the GLM backend, with the exception of needing to provide a sorted vector of subject IDs & a desired correlation structure. We also need to flip the `is.gee` flag in order to indicate that we'd like to fit estimating equations models (instead of mixed models). Since fitting GEEs is a fair bit more computationally complex than fitting GLMs, DE testing with the GEE backend takes a bit longer. Using more cores and / or running the tests on an HPC cluster (as this document is being compiled on a 2020 MacBook Pro) speeds things up considerably. Here we specify an [AR1 correlation structure](https://rdrr.io/cran/nlme/man/corAR1.html), which is the default for the GEE backend. ```{r gee-models, results='hold'} -de_test_gee <- testDynamic(expr.mat = counts_mat, +de_test_gee <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, - n.potential.basis.fns = 5, is.gee = TRUE, id.vec = sim_data$subject, cor.structure = "ar1", @@ -185,10 +182,10 @@ caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1)) ### GLMM Backend -We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects can be stratified by groups such as Treatment & Control and you expect the Treatment group to have a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure. **Note**: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly. +We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects belong to groups like Treatment & Control, and you expect the Treatment group to experience a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure. **Note**: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly. ```{r glmm-models} -de_test_glmm <- testDynamic(expr.mat = counts_mat, +de_test_glmm <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, @@ -200,7 +197,7 @@ de_test_glmm <- testDynamic(expr.mat = counts_mat, track.time = TRUE) ``` -Like the GLM backend, the GLMM backend use a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject. +Like the GLM backend, the GLMM backend uses a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject. ```{r glmm-results} de_res_glmm <- getResultsDE(de_test_glmm) @@ -211,7 +208,7 @@ select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Ove col.names = c("Gene", "Lineage", "LRT Stat.", "P-value", "Adj. P-value", "Predicted Dynamic Status")) ``` -The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it's more appropriate to use these more complex models when you expect expression dynamics to differ between subjects, with the difference being that you should use the GEE backend if you're interested in population-level trends & the GLMM backend if you're interested in per-subject trends. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best. +The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it's more appropriate to use these more complex models if expression dynamics might differ between subjects, with the difference being that you should use the GEE backend if you're interested in population-level trends & the GLMM backend if you're interested in per-subject trends. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best. ```{r eval-glmm} de_res_glmm <- inner_join(de_res_glmm, @@ -232,7 +229,7 @@ We can use the `plotModels()` to visually compare different types of modeling ba plotModels(de_test_glm, gene = "JARID2", pt = order_df, - expr.mat = counts_mat, + expr.mat = sim_data, size.factor.offset = cell_offset) ``` @@ -242,7 +239,7 @@ When plotting the models generated using the GLMM backend, we split by lineage & plotModels(de_test_glmm, gene = "WAPAL", pt = order_df, - expr.mat = counts_mat, + expr.mat = sim_data, size.factor.offset = cell_offset, is.glmm = TRUE, plot.null = FALSE, @@ -283,7 +280,7 @@ ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) color = "Leiden\nCluster", title = "Unsupervised Clustering of Gene Patterns") + theme_classic(base_size = 14) + - guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) + guides(color = guide_legend(override.aes = list(linewidth = 3, alpha = 1))) ``` Checking the true frequency of dynamic genes in each cluster seems to somewhat confirm that hypothesis: diff --git a/README.md b/README.md index f256b8d..52ab26c 100644 --- a/README.md +++ b/README.md @@ -160,9 +160,7 @@ plotPCA(sim_data, colour_by = "cell_time_normed") Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We’ll start with the simplest model, the GLM, then work our way through the other options in order of increasing -complexity. We first prepare our inputs - a dense matrix with cells as -rows & genes as columns (i.e., transposed from the way it’s stored in -`SingleCellExperiment` & `Seurat` objects), a dataframe containing our +complexity. We first prepare our inputs - a dataframe containing our cell ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it’s usually unnecessary to fit a model for every single gene @@ -179,7 +177,6 @@ dataframe passed as input to `testDynamic()`. set.seed(312) gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50), sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50)) -counts_mat <- as.matrix(t(counts(sim_data)[gene_sample, ])) order_df <- data.frame(X = sim_data$cell_time_normed) cell_offset <- createCellOffset(sim_data) ``` @@ -193,14 +190,13 @@ have one lineage. Parallel processing is turned on by default, and we use 2 cores here to speed up runtime. ``` r -de_test_glm <- testDynamic(expr.mat = counts_mat, +de_test_glm <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, - n.potential.basis.fns = 5, n.cores = 2, track.time = TRUE) -#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 26.588 secs" +#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 26.572 secs" ``` After the function finishes running, we use `getResultsDE()` to generate @@ -286,17 +282,16 @@ structure](https://rdrr.io/cran/nlme/man/corAR1.html), which is the default for the GEE backend. ``` r -de_test_gee <- testDynamic(expr.mat = counts_mat, +de_test_gee <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, - n.potential.basis.fns = 5, is.gee = TRUE, id.vec = sim_data$subject, cor.structure = "ar1", n.cores = 2, track.time = TRUE) -#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.849 mins" +#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.129 mins" ``` We again generate the table of DE test results. The variance of the @@ -368,19 +363,19 @@ caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1)) We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you’re most interested in how a -trajectory differs between subjects e.g., if the subjects can be -stratified by groups such as Treatment & Control and you expect the -Treatment group to have a different progression through the biological -process. Executing the function with the GLMM backend differs only in -that we switch the `is.glmm` flag to `TRUE` and no longer need to -specify a working correlation structure. **Note**: the GLMM backend is -still under development, as we are working on further reducing runtime -and increasing the odds of the underlying optimization process -converging successfully. As such, updates will be frequent and -functionality / results may shift slightly. +trajectory differs between subjects e.g., if the subjects belong to +groups like Treatment & Control, and you expect the Treatment group to +experience a different progression through the biological process. +Executing the function with the GLMM backend differs only in that we +switch the `is.glmm` flag to `TRUE` and no longer need to specify a +working correlation structure. **Note**: the GLMM backend is still under +development, as we are working on further reducing runtime and +increasing the odds of the underlying optimization process converging +successfully. As such, updates will be frequent and functionality / +results may shift slightly. ``` r -de_test_glmm <- testDynamic(expr.mat = counts_mat, +de_test_glmm <- testDynamic(sim_data, pt = order_df, genes = gene_sample, size.factor.offset = cell_offset, @@ -390,10 +385,10 @@ de_test_glmm <- testDynamic(expr.mat = counts_mat, id.vec = sim_data$subject, n.cores = 2, track.time = TRUE) -#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 5.306 mins" +#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.603 mins" ``` -Like the GLM backend, the GLMM backend use a likelihood ratio test to +Like the GLM backend, the GLMM backend uses a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in @@ -419,7 +414,7 @@ select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Ove The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it’s more appropriate to use these more complex models -when you expect expression dynamics to differ between subjects, with the +if expression dynamics might differ between subjects, with the difference being that you should use the GEE backend if you’re interested in population-level trends & the GLMM backend if you’re interested in per-subject trends. Since the dynamics in our simulated @@ -483,7 +478,7 @@ the same trend as the `scLANE` model, though the fitted trend from plotModels(de_test_glm, gene = "JARID2", pt = order_df, - expr.mat = counts_mat, + expr.mat = sim_data, size.factor.offset = cell_offset) ``` @@ -496,7 +491,7 @@ lineage & color the points by subject ID instead of by lineage. plotModels(de_test_glmm, gene = "WAPAL", pt = order_df, - expr.mat = counts_mat, + expr.mat = sim_data, size.factor.offset = cell_offset, is.glmm = TRUE, plot.null = FALSE, @@ -557,7 +552,7 @@ ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) color = "Leiden\nCluster", title = "Unsupervised Clustering of Gene Patterns") + theme_classic(base_size = 14) + - guides(color = guide_legend(override.aes = list(size = 3, alpha = 1))) + guides(color = guide_legend(override.aes = list(linewidth = 3, alpha = 1))) ``` diff --git a/man/figures/README-plot-models-glm-1.png b/man/figures/README-plot-models-glm-1.png index 80b7a14..875fbc3 100644 Binary files a/man/figures/README-plot-models-glm-1.png and b/man/figures/README-plot-models-glm-1.png differ diff --git a/man/figures/README-plot-models-glmm-1.png b/man/figures/README-plot-models-glmm-1.png index ba19c76..7003772 100644 Binary files a/man/figures/README-plot-models-glmm-1.png and b/man/figures/README-plot-models-glmm-1.png differ