scDECAF is a statistical learning algorithm for single-cell RNA-seq analysis. It identifies gene signatures, states, and transcriptional programs by learning vector representations of gene sets.
Through sparse selection, scDECAF highlights the most biologically relevant programs, improving interpretability of single-cell data.
Requires R ≥ 4.0.0.
install.packages("devtools")
devtools::install_github("DavisLaboratory/scDECAF")Here is a minimal runnable example with toy data:
library(scDECAF)
# Simulated expression matrix (100 genes x 200 cells)
set.seed(123)
x <- matrix(rpois(100*200, lambda = 5), nrow = 100, ncol = 200)
rownames(x) <- paste0("Gene", 1:100)
colnames(x) <- paste0("Cell", 1:200)
# Define toy gene sets
genesetlist <- list(
Pathway_A = rownames(x)[1:15],
Pathway_B = rownames(x)[16:30],
Pathway_C = rownames(x)[31:45]
)
# Highly variable genes (here, all genes for simplicity)
hvg <- rownames(x)
# Dummy 2D embedding (e.g., from PCA/UMAP)
cell_embedding <- matrix(rnorm(200*2), ncol = 2)
rownames(cell_embedding) <- colnames(x)
# Sparse selection of relevant gene sets
selected_gs <- pruneGenesets(
data = x,
genesetlist = genesetlist,
hvg = hvg,
embedding = cell_embedding,
min_gs_size = 5,
lambda = exp(-3)
)
# Build gene–set assignment matrix
target <- genesets2ids(
x[match(hvg, rownames(x)), ],
genesetlist[selected_gs]
)
# Compute gene-set scores
ann_res <- scDECAF(
data = x,
gs = target,
hvg = hvg,
k = 5,
embedding = cell_embedding,
n_components = min(2, ncol(target) - 1),
max_iter = 2,
thresh = 0.5
)
# Extract per-cell scores
scores <- attributes(ann_res)$raw_scores
head(scores[, 1:3]) # preview first few componentsYou can now add scores to your single-cell object (SingleCellExperiment, Seurat, or AnnData) and visualize them per cell.
- data: log-normalised single-cell expression matrix
- genesetlist: named list of gene sets
- hvg: highly variable genes
- embedding: reduced dimension embedding (UMAP, PCA, PHATE, etc.)
- min_gs_size: minimum gene set size
- lambda: shrinkage penalty
- n_components: number of CCA components
- k: nearest neighbors for refinement
| Workflow | Notebook |
|---|---|
| Pathway/gene signature screening and scoring | Kang et al. (2018): 25K PBMC single cells |
| Optimization of sparsity operator | Experimentation with the shrinkage penalty and gene set screening results in Kang et al. (2018) |
| PMBC COVID-19 analysis | Combining reference atlas mapping and Milo analysis with scDECAF gene set screening |
| Drug2cell analysis with scDECAF | [Running scDECAF with pre-computed Drug2cell scores in HECOA Organoid Atlas] |
Full analysis notebooks reproducing the manuscript are available in the reproducibility repository.
If you use scDECAF, please cite:
Hediyehzadeh, Whitfield, et al., Identification of cell types, states and programs by learning gene set representations, bioRxiv (2023). https://doi.org/10.1101/2023.09.08.556842
This project is released under the same license as the Davis Laboratory repositories.