RcppML

Vignettes here: https://zdebruine.github.io/RcppML/articles/getting-started.html

RcppML is an R package for high-performance Non-negative Matrix Factorization (NMF), truncated SVD/PCA, and divisive clustering of large sparse and dense matrices. It supports six statistical distributions via IRLS, cross-validation for automatic rank selection, optional CUDA GPU acceleration, out-of-core streaming for datasets larger than memory, and a composable graph DSL for multi-modal and deep NMF.

Overview

RcppML decomposes a matrix A into lower-rank non-negative factors:

A ≈ W · diag(d) · H

where W (features × k) and H (k × samples) have columns/rows scaled to unit sum via a diagonal d, ensuring interpretable, scale-invariant factors.

Key capabilities:

Fast NMF with alternating least squares, coordinate descent or Cholesky NNLS, and OpenMP parallelism via the Eigen C++ library
Six statistical distributions — Gaussian (MSE), Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie — fitted via Iteratively Reweighted Least Squares (IRLS)
Cross-validation with speckled holdout masks for principled rank selection
GPU acceleration via CUDA (cuBLAS/cuSPARSE) with automatic fallback to CPU
Streaming NMF from StreamPress (.spz) files for datasets that exceed available memory
FactorNet graph API for composable multi-modal, deep, and branching NMF pipelines
Truncated SVD with five methods: deflation, Krylov, Lanczos, IRLBA, randomized
Divisive clustering via recursive rank-2 factorizations

Installation

# Stable release from CRAN
install.packages("RcppML")

# Development version from GitHub (requires Rcpp and RcppEigen)
devtools::install_github("zdebruine/RcppML")

GPU support (optional): Requires CUDA Toolkit ≥ 11.0 installed on the system. See vignette("gpu-acceleration") for build instructions.

Quick Start

library(RcppML)

# Load a built-in gene expression dataset
data(aml)  # 824 genomic regions × 135 samples (dense matrix)

# Run NMF with rank k = 6
model <- nmf(aml, k = 6, seed = 42)
model
#> nmf model
#>   k = 6
#>   w: 824 x 6
#>   d: 6
#>   h: 6 x 135

# Inspect factor loadings
head(model@w)
model@d
model@h[, 1:5]

# Reconstruction error
evaluate(model, aml)

# Project onto new data
H_new <- predict(model, aml[, 1:50])

# Plot the model summary
plot(model)

Statistical Distributions

RcppML goes beyond standard Frobenius-norm NMF by supporting distribution-appropriate loss functions via IRLS. This is critical for count data (scRNA-seq, text mining) where Gaussian assumptions are wrong.

`loss=`	Distribution	Variance V(μ)	Use Case
`"mse"`	Gaussian	constant	Dense/general data (default)
`"gp"`	Generalized Poisson	μ(1+θ)²	Overdispersed counts
`"nb"`	Negative Binomial	μ + μ²/r	scRNA-seq, quadratic variance-mean
`"gamma"`	Gamma	μ²	Positive continuous data
`"inverse_gaussian"`	Inverse Gaussian	μ³	Heavy right-tailed positive
`"tweedie"`	Tweedie	μ^p	Flexible power-law variance

Example: Negative Binomial NMF for scRNA-seq

data(aml)

# NB is the natural choice for count data
model_nb <- nmf(aml, k = 6, loss = "nb")

# Fine-grained control via ... parameters
model_nb <- nmf(aml, k = 6,
                loss = "nb",
                nb_size_init = 10,     # Initial size parameter r
                nb_size_max = 1e6)     # Upper bound for r

# Automatic distribution selection via BIC
auto <- auto_nmf_distribution(aml, k = 6)
auto$best       # best distribution name
auto$comparison  # BIC/AIC table for each distribution

Zero-Inflation Models

For data with excess zeros (e.g., droplet-based scRNA-seq), zero-inflated GP and NB models estimate structural dropout probabilities separately from the count model:

# Zero-inflated Negative Binomial with per-row dropout
model_zinb <- nmf(aml, k = 6, loss = "nb", zi = "row")

# Per-column dropout estimation
model <- nmf(aml, k = 6, loss = "nb", zi = "col")

Cross-Validation

Cross-validation uses a speckled holdout mask — a random subset of matrix entries are held out during training, and test error is computed on these entries. This enables principled rank selection.

data(aml)

# Sweep k = 2 through 20, three replicates per rank
cv <- nmf(aml, k = 2:20, test_fraction = 0.1, cv_seed = 1:3)

# Visualize test error vs rank
plot(cv)

# Automatic rank selection (binary search)
model_auto <- nmf(aml, k = "auto")
model_auto@misc$rank_search$k_optimal

Cross-Validation Options

cv <- nmf(data, k = 2:15,
          test_fraction = 0.1,      # 10% holdout
          mask = "zeros",           # Only non-zero entries in test set (for sparse data)
          patience = 5,             # Early stopping patience
          cv_seed = c(42, 123, 7))  # Multiple seeds for replicates

Regularization

RcppML supports a comprehensive suite of regularization methods that can be combined freely:

L1 (LASSO) — Sparsity

# Symmetric L1 on both W and H
model <- nmf(data, k = 10, L1 = 0.1)

# Asymmetric: sparse H (embedding), dense W (loadings)
model <- nmf(data, k = 10, L1 = c(0, 0.2))

L2 (Ridge) — Shrinkage

model <- nmf(data, k = 10, L2 = c(0.01, 0.01))

L21 (Group Sparsity) — Factor Selection

L21-norm regularization drives entire rows of W or columns of H toward zero, effectively performing automatic factor selection:

model <- nmf(data, k = 20, L21 = c(0.1, 0))
# Some factors in W will be entirely zeroed out

Angular — Decorrelation

Encourages orthogonality between factors:

model <- nmf(data, k = 10, angular = c(0.1, 0.1))

Graph Laplacian — Spatial/Network Smoothness

If features or samples have a known graph structure (e.g., gene regulatory network, spatial coordinates), graph regularization encourages connected nodes to have similar factor representations:

# gene_graph: m × m sparse adjacency matrix
model <- nmf(data, k = 10, graph_W = gene_graph, graph_lambda = 0.5)

# Both feature and sample graphs
model <- nmf(data, k = 10,
             graph_W = gene_graph, graph_H = cell_graph,
             graph_lambda = c(0.5, 0.3))

Upper Bound Constraints

# Box constraint: all entries in W and H between 0 and 1
model <- nmf(data, k = 10, upper_bound = c(1, 1))

Robust NMF (Huber Loss)

# Huber-type robustness (less sensitive to outliers)
model <- nmf(data, k = 10, robust = TRUE)

# Custom Huber delta
model <- nmf(data, k = 10, robust = 2.0)

# MAE (L1 loss)
model <- nmf(data, k = 10, robust = "mae")

GPU Acceleration

RcppML optionally uses CUDA for GPU-accelerated NMF, delivering 10–20× speedups on large matrices.

# Check GPU availability
gpu_available()
#> [1] TRUE

gpu_info()
#> $device_name: "NVIDIA A100"
#> $total_memory_mb: 40960

# Sparse NMF on GPU
model <- nmf(data, k = 20, resource = "gpu")

# Dense NMF on GPU
model <- nmf(as.matrix(data), k = 20, resource = "gpu")

# Auto-dispatch (default): uses GPU if available, falls back to CPU
model <- nmf(data, k = 20, resource = "auto")

The GPU backend supports:

Standard NMF (sparse and dense)
Cross-validation NMF
MSE loss (GPU-native), all other losses via automatic CPU fallback
OpenMP + CUDA hybrid execution
Zero-copy NMF with st_read_gpu() for pre-loaded GPU data

Streaming Large Data (StreamPress `.spz` Files)

For datasets that exceed available RAM, RcppML streams data from StreamPress (.spz) compressed files — a column-oriented binary format with 10–20× compression via rANS entropy coding.

Writing and Reading SPZ Files

library(RcppML)
data(pbmc3k)

# Write sparse matrix to .spz file
spz_path <- tempfile(fileext = ".spz")
st_write(pbmc3k, spz_path, include_transpose = TRUE)

# Read back into memory
mat <- st_read(spz_path)
identical(dim(mat), dim(pbmc3k))  # TRUE

# File size comparison
file.size(spz_path)  # Much smaller than RDS/RDA

Streaming NMF

# NMF directly from .spz file — data never fully loaded into memory
model <- nmf(spz_path, k = 10)

# Streaming cross-validation
cv <- nmf(spz_path, k = 2:15, test_fraction = 0.1)

# Streaming with non-MSE distributions
model <- nmf(spz_path, k = 10, loss = "nb")

Streaming NMF processes data in column panels with double-buffered asynchronous I/O, maintaining O(m·k + n·k + chunk) memory regardless of total matrix size.

FactorNet Graph API

The factor_net() API composes complex factorization pipelines from simple building blocks. Multi-modal, deep, and branching NMF networks are expressed as directed graphs:

Multi-Modal NMF

Share a common embedding H across two data matrices (e.g., RNA + ATAC from the same cells):

# Define inputs
inp_rna  <- factor_input(rna_matrix, "rna")
inp_atac <- factor_input(atac_matrix, "atac")

# Create shared input node
shared <- factor_shared(inp_rna, inp_atac)

# Build NMF layer with per-factor regularization
layer <- shared |> nmf_layer(k = 10, name = "shared",
  W = W(L1 = 0.1),
  H = H(L1 = 0.05))

# Compile and fit the network
net <- factor_net(
  inputs = list(inp_rna, inp_atac),
  output = layer,
  config = factor_config(maxit = 100, seed = 42))

result <- fit(net)

# Access results (layer name = "shared")
result$shared$W$rna   # RNA feature loadings
result$shared$W$atac  # ATAC feature loadings
result$shared$H       # Shared cell embedding (k × n_cells)

Deep NMF

Stack layers for hierarchical decomposition:

inp <- factor_input(data, "X")

# Encoder layer: reduce to 20 factors
enc <- inp |> nmf_layer(k = 20, name = "encoder")

# Bottleneck: compress to 5 factors
bot <- enc |> nmf_layer(k = 5, name = "bottleneck")

net <- factor_net(inputs = list(inp), output = bot,
                  config = factor_config(maxit = 100, seed = 42))
result <- fit(net)

Conditional Factorization

Append conditioning metadata to a layer's H before passing downstream:

inp <- factor_input(data, "X")
layer1 <- inp |> nmf_layer(k = 5)

# Z is a conditioning matrix (n × p)
Z <- matrix(rep(c(1, 0, 0, 1), c(50, 50, 50, 50)), nrow = 100, ncol = 2)
conditioned <- factor_condition(layer1, Z)

layer2 <- conditioned |> nmf_layer(k = 10, name = "output")

net <- factor_net(inputs = list(inp), output = layer2,
                  config = factor_config(seed = 42))
result <- fit(net)

Truncated SVD / PCA

RcppML provides five SVD algorithms with optional constraints and cross-validation:

# Standard truncated SVD
result <- svd(data, k = 10)

# PCA (centered SVD)
result <- pca(data, k = 10)

# Non-negative PCA
result <- pca(data, k = 10, nonneg = TRUE)

# Sparse PCA with L1 penalty
result <- pca(data, k = 10, L1 = c(0, 0.1))

# Auto-rank selection
result <- svd(data, k = "auto")

# Method selection
result <- svd(data, k = 10, method = "lanczos")   # Fast unconstrained
result <- svd(data, k = 10, method = "krylov")     # Block method, all constraints
result <- svd(data, k = 10, method = "randomized") # Approximate, very fast

Method	Constraints	Speed	Best For
`deflation`	All	Moderate	Mixed constraints, small k
`krylov`	All	Fast	Large k with constraints
`lanczos`	None	Very fast	Unconstrained SVD
`irlba`	None	Fast	General unconstrained
`randomized`	None	Very fast	Approximate large-scale

Divisive Clustering

RcppML implements spectral clustering via recursive rank-2 NMF:

# Single bipartition
bp <- bipartition(data)
bp$samples  # Cluster assignments (0/1)

# Recursive divisive clustering
clusters <- dclust(data, min_samples = 50, min_dist = 0.05)
clusters$id  # Cluster labels

# Consensus clustering (multiple runs for stability)
cons <- consensus_nmf(data, k = 5, n_runs = 20)
plot(cons)

Non-Negative Least Squares (NNLS)

Project factor matrices onto new data:

# Given W from NMF, solve for H on new data
H_new <- nnls(w = model@w, A = new_data)

# Solve for W given H (transpose projection)
W_new <- nnls(h = model@h, A = new_data)

# Unconstrained LS (semi-NMF projection)
H_ls <- nnls(w = model@w, A = new_data, nonneg = c(TRUE, FALSE))

# With regularization
H_sparse <- nnls(w = model@w, A = new_data, L1 = c(0, 0.1))

Semi-NMF

Allow negative values in W (unconstrained) while keeping H non-negative:

model <- nmf(data, k = 10, nonneg = c(FALSE, TRUE))
any(model@w < 0)  # TRUE — W can have negative entries
all(model@h >= 0)  # TRUE — H remains non-negative

Key Parameters Reference

Parameter	Type	Default	Description
`k`	int/vector/"auto"	—	Rank (vector for CV sweep, "auto" for search)
`loss`	string	`"mse"`	Loss function: mse, gp, nb, gamma, inverse_gaussian, tweedie
`L1`	numeric(2)	`c(0,0)`	LASSO penalty [W, H]
`L2`	numeric(2)	`c(0,0)`	Ridge penalty [W, H]
`L21`	numeric(2)	`c(0,0)`	Group sparsity [W, H]
`angular`	numeric(2)	`c(0,0)`	Decorrelation penalty [W, H]
`nonneg`	logical(2)	`c(TRUE,TRUE)`	Non-negativity [W, H]
`upper_bound`	numeric(2)	`c(0,0)`	Box constraint [W, H] (0 = none)
`zi`	string	`"none"`	Zero-inflation: none, row, col
`robust`	logical/numeric	`FALSE`	Huber robustness (TRUE = δ=1.345, numeric = custom δ)
`solver`	string	`"auto"`	NNLS solver: auto, cd, cholesky
`seed`	int/matrix/string	`NULL`	Initialization: NULL (random), integer (reproducible), "lanczos", "irlba"
`mask`	various	`NULL`	Missing data: NULL, "zeros", "NA", or a mask matrix
`resource`	string	`"auto"`	Compute: auto, cpu, gpu
`test_fraction`	numeric	`0`	CV holdout fraction (0 = no CV)
`tol`	numeric	`1e-4`	Convergence tolerance
`maxit`	int	`100`	Maximum iterations
`threads`	int	`0`	OpenMP threads (0 = all)
`verbose`	logical	`FALSE`	Print progress

Built-in Datasets

Dataset	Description	Dimensions	Format
`pbmc3k`	PBMC single-cell RNA-seq (10x Genomics)	13,714 × 2,638	SPZ raw bytes
`aml`	AML leukemia ATAC-seq	824 × 135	Dense matrix
`golub`	Golub leukemia microarray	38 × 5,000	Sparse dgCMatrix
`movielens`	MovieLens ratings	3,867 × 610	Sparse dgCMatrix
`hawaiibirds`	Hawaii bird survey counts	183 × 1,183	Sparse dgCMatrix
`olivetti`	Olivetti face images	400 × 4,096	Sparse dgCMatrix
`digits`	Handwritten digits (MNIST subset)	1,797 × 64	Dense matrix

Performance Tips

Sparse input: Use dgCMatrix format (from the Matrix package) for sparse data — RcppML auto-detects and uses optimized sparse routines.
Solver selection: For large k (> 32), solver = "cholesky" can be faster than coordinate descent. Use solver = "auto" (default) for automatic selection.
Initialization: seed = "lanczos" provides better starting points and can reduce iteration count by 30–50%, but adds upfront SVD cost.
Threads: RcppML uses OpenMP. Set options(RcppML.threads = 4) to control parallelism, or threads = 0 for all available cores.
GPU: For matrices with > 10K rows and k > 8, GPU acceleration provides significant speedup. Use resource = "gpu".
Streaming: For datasets larger than available RAM, write to .spz format with st_write() and factorize directly from the file path.

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, requesting features, and submitting pull requests.

Citation

citation("RcppML")

DeBruine ZJ, Melcher K, Triche TJ (2021). "High-performance non-negative
matrix factorization for large single-cell data." BioRXiv.
doi:10.1101/2021.09.01.458620.

License

GPL (≥ 3)

Name		Name	Last commit message	Last commit date
Latest commit History 573 Commits
.github		.github
R		R
data		data
docs		docs
inst		inst
man		man
src		src
tests		tests
tools		tools
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CRAN_AUDIT.md		CRAN_AUDIT.md
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
RcppML.Rproj		RcppML.Rproj
_pkgdown.yml		_pkgdown.yml
cleanup		cleanup
configure		configure
cran-comments.md		cran-comments.md

Folders and files

Latest commit

History

Repository files navigation

RcppML

Overview

Installation

Quick Start

Statistical Distributions

Example: Negative Binomial NMF for scRNA-seq

Zero-Inflation Models

Cross-Validation

Cross-Validation Options

Regularization

L1 (LASSO) — Sparsity

L2 (Ridge) — Shrinkage

L21 (Group Sparsity) — Factor Selection

Angular — Decorrelation

Graph Laplacian — Spatial/Network Smoothness

Upper Bound Constraints

Robust NMF (Huber Loss)

GPU Acceleration

Streaming Large Data (StreamPress .spz Files)

Writing and Reading SPZ Files

Streaming NMF

FactorNet Graph API

Multi-Modal NMF

Deep NMF

Conditional Factorization

Truncated SVD / PCA

Divisive Clustering

Non-Negative Least Squares (NNLS)

Semi-NMF

Key Parameters Reference

Built-in Datasets

Performance Tips

Contributing

Citation

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Streaming Large Data (StreamPress `.spz` Files)

Packages