GitHub - BojarLab/GlycoForge: A simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection

GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.

Key Features

Two simulation modes: Fully synthetic or hybrid (extract factor from input reference data + simulate batch effect)
Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans

Quick Start

Installation

Python >= 3.10 required.
Core dependency: glycowork>=1.6.4

pip install glycoforge

OR

git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

See run_simulation.ipynb for interactive examples, or use_cases/batch_correction/ for batch correction workflows.

How the simulator works

We keep everything in the CLR (centered log-ratio) space:

First, draw a healthy baseline composition from a Dirichlet prior: p_H ~ Dirichlet(alpha_H).
Flip to CLR: z_H = clr(p_H).
For selected glycans, push the signal using real or synthetic effect sizes: z_U = z_H + m * lambda * d_robust, where m is the differential mask, lambda is bio_strength, and d_robust is the effect vector after robust_effect_size_processing.
- Simplified mode: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
- Hybrid mode: start from the Cohen’s d values returned by glycowork.get_differential_expression; define_differential_mask lets you restrict the injection to significant hits or top-N glycans before scaling.
Invert back to proportions: p_U = invclr(z_U) and scale by k_dir to get alpha_U, note that the healthy and unhealthy Dirichlet strengths use different k_dir values, and a separate variance_ratio controls their relative magnitude.
Batch effects ride on top as direction vectors u_b, so a clean CLR sample Y_clean becomes Y_with_batch = Y_clean + kappa_mu * u_b + epsilon, with var_b controlling spread.

Simulation Modes

The pipeline entry point is glycoforge.simulate() with two modes controlled by data_source. Configuration files are in sample_config/.

Simplified mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)

No real data dependency. Ideal for controlled experiments with known ground truth.

Pipeline steps:

Initializes uniform healthy baseline: alpha_H = ones(n_glycans) * 10
For each random seed, generates alpha_U by randomly scaling alpha_H:
- up_frac (default 30%) upregulated with scale factors from up_scale_range=(1.1, 3.0)
- down_frac (default 30%) downregulated with scale factors from down_scale_range=(0.3, 0.9)
- Remaining glycans (~40%) stay unchanged
Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
Defines batch effect direction vectors u_dict once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
Applies batch effects controlled by kappa_mu (shift strength) and var_b (variance scaling)
Optionally applies MNAR (Missing Not At Random) missingness:
- missing_fraction: proportion of missing values (0.0-1.0)
- mnar_bias: intensity-dependent bias (default 2.0, range 0.5-5.0)
- Left-censored pattern: low-abundance glycans more likely to be missing
Grid search over kappa_mu and var_b produces multiple datasets under identical batch effect structure

Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, missing_fraction, mnar_bias

Hybrid mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)

Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.

Pipeline steps:

Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via column_prefix)
Runs CLR-based differential expression via glycowork.get_differential_expression to compute Cohen's d effect sizes
Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
Applies differential_mask to select which glycans receive biological signal injection:
- "All": inject into all glycans
- "significant": only glycans marked significant by glycowork
- "Top-N": top N glycans by absolute effect size (e.g., "Top-10")
Processes effect sizes through robust_effect_size_processing:
- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses winsorize_percentile)
- Normalizes by baseline (baseline_method: median, MAD, or p75)
- Returns normalized d_robust scaled by bio_strength
Injects effects in CLR space: z_U = z_H + mask * bio_strength * d_robust
Converts back to proportions: p_U = invclr(z_U)
Scales by Dirichlet concentration: alpha_H = k_dir * p_H and alpha_U = (k_dir / variance_ratio) * p_U
Samples clean cohorts from Dirichlet(alpha_H) and Dirichlet(alpha_U) with n_H healthy and n_U unhealthy samples
Defines batch effect direction vectors u_dict once per run (fixed seed ensures fair comparison across parameter combinations)
Applies batch effects: y_batch = y_clean + kappa_mu * sigma * u_b + epsilon, where epsilon ~ N(0, sqrt(var_b) * sigma)
Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
Grid search over bio_strength, k_dir, variance_ratio, kappa_mu, var_b to systematically test biological signal and batch effect interactions

Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias

Use Cases

The use_cases/batch_correction/ directory demonstrates:

Call glycoforge simulation, and then apply correction workflow
Batch correction effectiveness metrics visualization

Limitation

Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
glycoforge		glycoforge
tests		tests
use_cases/batch_correction		use_cases/batch_correction
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
glycoforge_logo.jpg		glycoforge_logo.jpg
pyproject.toml		pyproject.toml
run_simulation.ipynb		run_simulation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Key Features

Quick Start

Installation

Usage

How the simulator works

Simulation Modes

Use Cases

Limitation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

BojarLab/GlycoForge

Folders and files

Latest commit

History

Repository files navigation

Key Features

Quick Start

Installation

Usage

How the simulator works

Simulation Modes

Use Cases

Limitation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages