xCell pipeline for tumor microenvironment (TME) cell-type enrichment profiling.
This repository provides a reproducible Python interface to the xCell method (Aran et al., Genome Biology 2017) , which estimates enrichment of 64 immune and stromal cell types from bulk gene expression data.
The script xcell_tme_deconvolution_runner.py is a thin but robust wrapper around an
R function (run_xcell_pipeline()) implemented via rpy2. It:
- Runs the full xCell enrichment and spillover compensation pipeline.
- Supports multiple expression file formats (
.csv,.tsv,.txt,.xlsx). - Automatically detects whether genes are in rows or columns and re-orients as needed.
- Handles Ensembl-to-HGNC symbol mapping and duplicates collapsing.
- Applies data-type-aware normalization (edgeR TMM, log transforms, or pass-through).
- Produces xCell scores, composite TME indices, heatmaps, PCA/UMAP plots, and stacked compositions.
- Writes biological context and automated interpretations for each cell type.
Typical layout for this project:
xcell_tme_deconvolution_runner.py– main Python CLI wrapper.requirements.txt– Python dependencies (includingrpy2).Xcell_Paper_ref.pdf– reference PDF of the original xCell paper.Teset_Run/– example folder with test data and outputs..venv/– optional local virtual environment (recommended).
You can extend this with more structured folders (e.g. xcell_enrichment/, xcell_tests/,
xcell_docs/) as the project grows.
- Python 3.12 (tested)
- Recommended OS: Windows, Linux, or macOS with a working R installation.
- Key Python package:
rpy2(installed viarequirements.txt).
- R >= 4.1 (recommended)
- The script can optionally auto-install R dependencies when run with
--do-install:
CRAN:
data.table, readr, stringr, edgeR, pheatmap, ggplot2, RColorBrewer, viridis,
dplyr, tidyr, tibble, knitr, kableExtra, rmarkdown, uwot, readxl,
cowplot, gridExtra
Bioconductor:
xCell, AnnotationDbi, org.Hs.eg.db
If you prefer to manage R packages yourself (e.g. on HPC), you can install these manually
and run the pipeline with --no-install.
Create and use a dedicated virtual environment for this project.
cd <path-to-your-repo>
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
cd <path-to-your-repo>
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
The expression table can be:
- Genes in rows, samples in columns, first column = gene ID.
- Or samples in rows, genes in columns, first column = sample ID.
The pipeline will:
- Detect the orientation and transpose if necessary.
- Convert numeric values appropriately and handle non-numeric entries gracefully.
- Optionally map Ensembl IDs (e.g.
ENSG...) to HGNC symbols viaorg.Hs.eg.db. - Collapse duplicated gene symbols by summation.
The metadata table must contain at least the following columns (case-insensitive, the script normalizes header names):
sample_id– raw sample identifier matching the expression matrix columns/rows.condition– grouping variable (e.g.AKI,Control, etc.).
Internally, the pipeline standardizes sample IDs (lowercase, strips extensions, cleans symbols) and aligns metadata with expression samples. Samples present in metadata but not in the expression matrix are dropped with a note in the logs.
Once your virtual environment is activated and dependencies are installed, you can run:
python xcell_tme_deconvolution_runner.py \
--data-dir "D:\AyassBio_Workspace_Downloads\xcell-deonv\xcell_py_pkg\Teset_Run" \
--expr-file "GSE139061_Eadon_processed_QN_101419.csv" \
--meta-file "sample_metadata_patientAKI.tsv" \
--out-dir "xcell_output_V31_BIO" \
--no-install
Notes:
--data-diris used as a base path to resolve relative--expr-file,--meta-file, and--out-dir.- You can provide absolute paths for any of these arguments; they will be respected.
--no-installskips automatic R package installation (recommended if your R environment is already set up).
--data-dir(optional): base directory for resolving relative paths.--expr-file(required): expression matrix (csv/tsv/txt/xlsx).--meta-file(required): metadata table withsample_idandcondition.--out-dir(required): output directory for all results.--do-install/--no-install: whether to install/ensure R dependencies from within the pipeline.--thresh-med-high: median threshold for calling enrichment “High” (default 0.20).--thresh-med-mod: median threshold for calling enrichment “Moderate” (default 0.10).--pres-fdr-alpha: FDR cut-off for presence significance (default 0.10).--pres-min-frac: minimum fraction of significant samples to call “Frequent” (default 0.35).--umap-min-samples: minimum number of samples required to compute UMAP (default 20).--max-cards: maximum number of “sample cards” (not currently used in plotting; accepted for future extension).--top-n-card: top-N cell types per sample card (currently used for limiting boxplot subsets).
All outputs are written under --out-dir. The pipeline creates a structured set of subfolders:
xCell_Scores/xcell_raw_scores.csv– raw enrichment scores from xCell.xcell_transformed_scores.csv– scores after curve fitting / transformation.xcell_final_scores.csv– spillover-compensated scores (primary matrix to use).Cohort_TME_Matrix.csv– per-sample summary table containing:sample_std– standardized sample ID.condition.ImmuneScore_xCell.StromaScore_xCell.MicroenvironmentScore_xCell.TME_Inflammation_Index– “hot vs. cold” TME proxy.
xcell_cohort_celltype_enrichment_summary.csv– per-cell-type median, presence fraction, and narrative interpretation.High_Variability_CellTypes.csv– cell types with many significant samples by FDR.normalized_expression_<method>.csv– normalized expression matrix (edgeR TMM logCPM, log2, or as-is).
Cell-Type Enrichment Profile (Per Sample)/TME_Enrichment_Heatmap_AllCellTypes.png– heatmap of all cell types across samples.TopVariability_CellTypes_Heatmap.png– heatmap of the most variable cell types.xCell_Score_PCA_by_Condition.png– PCA of xCell scores colored by condition.CellType_Correlation_Spearman.png– cell-type correlation matrix.UMAP_xCell_EnrichmentSpace.png– UMAP of xCell enrichment space (if enough samples).Composite_Indices_by_Condition.png– ImmuneScore, StromaScore, MicroenvironmentScore, and TME inflammation index by condition.Stacked_Composition_AllCellTypes.png– stacked composition plot of enrichment per sample.Boxplots_TopK/– boxplots of top-variability cell types stratified by condition.Xcell_Analytical_Figures/– reserved for additional figures/cards (future extension).
ImmuneProfile_Context/xcell_bio_context.csv– high-level biological descriptions for each xCell cell type and composite score.
sessionInfo.txt– full R session information for reproducibility (R version, packages, etc.).
- All key steps are logged via intermediate CSV files and plots.
- R session info is saved to
sessionInfo.txtto capture exact package versions. - Orientation detection for expression matrices is automatic and debug CSVs are written if alignment fails.
- All file paths used by R are passed explicitly from the Python CLI; no hidden defaults are used for I/O.
- Ensure your virtual environment is activated.
- Re-run
pip install -r requirements.txt.
- If automatic installation (
--do-install) fails, install the listed CRAN/Bioconductor packages manually in R, then run with--no-install. - Check that your R installation is on the system PATH and compatible with your Python/rpy2 build.
- Verify that
sample_idin the metadata matches the expression sample IDs. - Remember that the script standardizes IDs (lowercases, strips extensions, and cleans symbols).
- Inspect generated debug files
_debug_plotmat_cols.csvand_debug_requested_samples.csvin the output directory.
If you use this wrapper in your work, please cite both the original xCell paper and this repository:
-
xCell method:
Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology. 2017;18:220. doi:10.1186/s13059-017-1349-1. -
Python wrapper:
Please reference this GitHub repository (URL and commit) in the Methods section as the implementation used for running xCell.