GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.
- Two simulation modes: Fully synthetic or hybrid (extract factor from input reference data + simulate batch effect)
- Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
- MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans
- Python >= 3.10 required.
- Core dependency:
glycowork>=1.6.4
pip install glycoforgeOR
git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .See run_simulation.ipynb for interactive examples, or use_cases/batch_correction/
for batch correction workflows.
We keep everything in the CLR (centered log-ratio) space:
- First, draw a healthy baseline composition from a Dirichlet prior:
p_H ~ Dirichlet(alpha_H). - Flip to CLR:
z_H = clr(p_H). - For selected glycans, push the signal using real or synthetic effect sizes:
z_U = z_H + m * lambda * d_robust, wheremis the differential mask,lambdaisbio_strength, andd_robustis the effect vector afterrobust_effect_size_processing.- Simplified mode: draw synthetic effect sizes (log-fold changes) and pass them through the same robust processing pipeline.
- Hybrid mode: start from the Cohen’s d values returned by
glycowork.get_differential_expression;define_differential_masklets you restrict the injection to significant hits or top-N glycans before scaling.
- Invert back to proportions:
p_U = invclr(z_U)and scale byk_dirto getalpha_U, note that the healthy and unhealthy Dirichlet strengths use differentk_dirvalues, and a separatevariance_ratiocontrols their relative magnitude. - Batch effects ride on top as direction vectors
u_b, so a clean CLR sampleY_cleanbecomesY_with_batch = Y_clean + kappa_mu * u_b + epsilon, withvar_bcontrolling spread.
The pipeline entry point is glycoforge.simulate() with two modes controlled by data_source. Configuration files are in sample_config/.
Simplified mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)
No real data dependency. Ideal for controlled experiments with known ground truth.
Pipeline steps:
- Initializes uniform healthy baseline:
alpha_H = ones(n_glycans) * 10 - For each random seed, generates
alpha_Uby randomly scalingalpha_H:up_frac(default 30%) upregulated with scale factors fromup_scale_range=(1.1, 3.0)down_frac(default 30%) downregulated with scale factors fromdown_scale_range=(0.3, 0.9)- Remaining glycans (~40%) stay unchanged
- Samples clean cohorts from
Dirichlet(alpha_H)andDirichlet(alpha_U)withn_Hhealthy andn_Uunhealthy samples - Defines batch effect direction vectors
u_dictonce per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep) - Applies batch effects controlled by
kappa_mu(shift strength) andvar_b(variance scaling) - Optionally applies MNAR (Missing Not At Random) missingness:
missing_fraction: proportion of missing values (0.0-1.0)mnar_bias: intensity-dependent bias (default 2.0, range 0.5-5.0)- Left-censored pattern: low-abundance glycans more likely to be missing
- Grid search over
kappa_muandvar_bproduces multiple datasets under identical batch effect structure
Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, missing_fraction, mnar_bias
Hybrid mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)
Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.
Pipeline steps:
- Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via
column_prefix) - Runs CLR-based differential expression via
glycowork.get_differential_expressionto compute Cohen's d effect sizes - Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
- Applies
differential_maskto select which glycans receive biological signal injection:"All": inject into all glycans"significant": only glycans marked significant by glycowork"Top-N": top N glycans by absolute effect size (e.g.,"Top-10")
- Processes effect sizes through
robust_effect_size_processing:- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85-99, or uses
winsorize_percentile) - Normalizes by baseline (
baseline_method: median, MAD, or p75) - Returns normalized
d_robustscaled bybio_strength
- Injects effects in CLR space:
z_U = z_H + mask * bio_strength * d_robust - Converts back to proportions:
p_U = invclr(z_U) - Scales by Dirichlet concentration:
alpha_H = k_dir * p_Handalpha_U = (k_dir / variance_ratio) * p_U - Samples clean cohorts from
Dirichlet(alpha_H)andDirichlet(alpha_U)withn_Hhealthy andn_Uunhealthy samples - Defines batch effect direction vectors
u_dictonce per run (fixed seed ensures fair comparison across parameter combinations) - Applies batch effects:
y_batch = y_clean + kappa_mu * sigma * u_b + epsilon, whereepsilon ~ N(0, sqrt(var_b) * sigma) - Optionally applies MNAR missingness (same as Simplified mode: left-censored pattern biased toward low-abundance glycans)
- Grid search over
bio_strength,k_dir,variance_ratio,kappa_mu,var_bto systematically test biological signal and batch effect interactions
Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias
The use_cases/batch_correction/ directory demonstrates:
- Call
glycoforgesimulation, and then apply correction workflow - Batch correction effectiveness metrics visualization
Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (>=3 groups) requires refactoring Dirichlet parameter generation and evaluation metrics.
