Skip to content

xCell framework, enabling expert-level tumor microenvironment cell-type enrichment and TME index profiling from bulk transcriptomic data.

Notifications You must be signed in to change notification settings

shari01/Cell-Type-Scoring-and-Enrichment-Using-xCell

Repository files navigation

xCell TME Enrichment

xCell pipeline for tumor microenvironment (TME) cell-type enrichment profiling.


Overview

This repository provides a reproducible Python interface to the xCell method (Aran et al., Genome Biology 2017) , which estimates enrichment of 64 immune and stromal cell types from bulk gene expression data.

The script xcell_tme_deconvolution_runner.py is a thin but robust wrapper around an R function (run_xcell_pipeline()) implemented via rpy2. It:

  • Runs the full xCell enrichment and spillover compensation pipeline.
  • Supports multiple expression file formats (.csv, .tsv, .txt, .xlsx).
  • Automatically detects whether genes are in rows or columns and re-orients as needed.
  • Handles Ensembl-to-HGNC symbol mapping and duplicates collapsing.
  • Applies data-type-aware normalization (edgeR TMM, log transforms, or pass-through).
  • Produces xCell scores, composite TME indices, heatmaps, PCA/UMAP plots, and stacked compositions.
  • Writes biological context and automated interpretations for each cell type.

Repository Layout

Typical layout for this project:

  • xcell_tme_deconvolution_runner.py – main Python CLI wrapper.
  • requirements.txt – Python dependencies (including rpy2).
  • Xcell_Paper_ref.pdf – reference PDF of the original xCell paper.
  • Teset_Run/ – example folder with test data and outputs.
  • .venv/ – optional local virtual environment (recommended).

You can extend this with more structured folders (e.g. xcell_enrichment/, xcell_tests/, xcell_docs/) as the project grows.


Requirements

Python

  • Python 3.12 (tested)
  • Recommended OS: Windows, Linux, or macOS with a working R installation.
  • Key Python package: rpy2 (installed via requirements.txt).

R

  • R >= 4.1 (recommended)
  • The script can optionally auto-install R dependencies when run with --do-install:
CRAN:
  data.table, readr, stringr, edgeR, pheatmap, ggplot2, RColorBrewer, viridis,
  dplyr, tidyr, tibble, knitr, kableExtra, rmarkdown, uwot, readxl,
  cowplot, gridExtra

Bioconductor:
  xCell, AnnotationDbi, org.Hs.eg.db

If you prefer to manage R packages yourself (e.g. on HPC), you can install these manually and run the pipeline with --no-install.


Python Virtual Environment (recommended)

Create and use a dedicated virtual environment for this project.

Windows (PowerShell)

cd <path-to-your-repo>

python -m venv .venv
.\.venv\Scripts\Activate.ps1

pip install --upgrade pip
pip install -r requirements.txt

Linux / macOS

cd <path-to-your-repo>

python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Input Files

1. Expression matrix (--expr-file)

The expression table can be:

  • Genes in rows, samples in columns, first column = gene ID.
  • Or samples in rows, genes in columns, first column = sample ID.

The pipeline will:

  • Detect the orientation and transpose if necessary.
  • Convert numeric values appropriately and handle non-numeric entries gracefully.
  • Optionally map Ensembl IDs (e.g. ENSG...) to HGNC symbols via org.Hs.eg.db.
  • Collapse duplicated gene symbols by summation.

2. Metadata table (--meta-file)

The metadata table must contain at least the following columns (case-insensitive, the script normalizes header names):

  • sample_id – raw sample identifier matching the expression matrix columns/rows.
  • condition – grouping variable (e.g. AKI, Control, etc.).

Internally, the pipeline standardizes sample IDs (lowercase, strips extensions, cleans symbols) and aligns metadata with expression samples. Samples present in metadata but not in the expression matrix are dropped with a note in the logs.


Running the Pipeline

Basic command

Once your virtual environment is activated and dependencies are installed, you can run:

python xcell_tme_deconvolution_runner.py \
  --data-dir "D:\AyassBio_Workspace_Downloads\xcell-deonv\xcell_py_pkg\Teset_Run" \
  --expr-file "GSE139061_Eadon_processed_QN_101419.csv" \
  --meta-file "sample_metadata_patientAKI.tsv" \
  --out-dir "xcell_output_V31_BIO" \
  --no-install

Notes:

  • --data-dir is used as a base path to resolve relative --expr-file, --meta-file, and --out-dir.
  • You can provide absolute paths for any of these arguments; they will be respected.
  • --no-install skips automatic R package installation (recommended if your R environment is already set up).

Key CLI arguments

  • --data-dir (optional): base directory for resolving relative paths.
  • --expr-file (required): expression matrix (csv/tsv/txt/xlsx).
  • --meta-file (required): metadata table with sample_id and condition.
  • --out-dir (required): output directory for all results.
  • --do-install / --no-install: whether to install/ensure R dependencies from within the pipeline.
  • --thresh-med-high: median threshold for calling enrichment “High” (default 0.20).
  • --thresh-med-mod: median threshold for calling enrichment “Moderate” (default 0.10).
  • --pres-fdr-alpha: FDR cut-off for presence significance (default 0.10).
  • --pres-min-frac: minimum fraction of significant samples to call “Frequent” (default 0.35).
  • --umap-min-samples: minimum number of samples required to compute UMAP (default 20).
  • --max-cards: maximum number of “sample cards” (not currently used in plotting; accepted for future extension).
  • --top-n-card: top-N cell types per sample card (currently used for limiting boxplot subsets).

Outputs

All outputs are written under --out-dir. The pipeline creates a structured set of subfolders:

  • xCell_Scores/
    • xcell_raw_scores.csv – raw enrichment scores from xCell.
    • xcell_transformed_scores.csv – scores after curve fitting / transformation.
    • xcell_final_scores.csv – spillover-compensated scores (primary matrix to use).
    • Cohort_TME_Matrix.csv – per-sample summary table containing:
      • sample_std – standardized sample ID.
      • condition.
      • ImmuneScore_xCell.
      • StromaScore_xCell.
      • MicroenvironmentScore_xCell.
      • TME_Inflammation_Index – “hot vs. cold” TME proxy.
    • xcell_cohort_celltype_enrichment_summary.csv – per-cell-type median, presence fraction, and narrative interpretation.
    • High_Variability_CellTypes.csv – cell types with many significant samples by FDR.
    • normalized_expression_<method>.csv – normalized expression matrix (edgeR TMM logCPM, log2, or as-is).
  • Cell-Type Enrichment Profile (Per Sample)/
    • TME_Enrichment_Heatmap_AllCellTypes.png – heatmap of all cell types across samples.
    • TopVariability_CellTypes_Heatmap.png – heatmap of the most variable cell types.
    • xCell_Score_PCA_by_Condition.png – PCA of xCell scores colored by condition.
    • CellType_Correlation_Spearman.png – cell-type correlation matrix.
    • UMAP_xCell_EnrichmentSpace.png – UMAP of xCell enrichment space (if enough samples).
    • Composite_Indices_by_Condition.png – ImmuneScore, StromaScore, MicroenvironmentScore, and TME inflammation index by condition.
    • Stacked_Composition_AllCellTypes.png – stacked composition plot of enrichment per sample.
    • Boxplots_TopK/ – boxplots of top-variability cell types stratified by condition.
    • Xcell_Analytical_Figures/ – reserved for additional figures/cards (future extension).
  • ImmuneProfile_Context/
    • xcell_bio_context.csv – high-level biological descriptions for each xCell cell type and composite score.
  • sessionInfo.txt – full R session information for reproducibility (R version, packages, etc.).

Reproducibility & Design Choices

  • All key steps are logged via intermediate CSV files and plots.
  • R session info is saved to sessionInfo.txt to capture exact package versions.
  • Orientation detection for expression matrices is automatic and debug CSVs are written if alignment fails.
  • All file paths used by R are passed explicitly from the Python CLI; no hidden defaults are used for I/O.

Troubleshooting

ImportError: No module named 'rpy2'

  • Ensure your virtual environment is activated.
  • Re-run pip install -r requirements.txt.

R package installation issues

  • If automatic installation (--do-install) fails, install the listed CRAN/Bioconductor packages manually in R, then run with --no-install.
  • Check that your R installation is on the system PATH and compatible with your Python/rpy2 build.

No overlapping sample IDs between metadata and expression

  • Verify that sample_id in the metadata matches the expression sample IDs.
  • Remember that the script standardizes IDs (lowercases, strips extensions, and cleans symbols).
  • Inspect generated debug files _debug_plotmat_cols.csv and _debug_requested_samples.csv in the output directory.

Citation

If you use this wrapper in your work, please cite both the original xCell paper and this repository:

  • xCell method:
    Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology. 2017;18:220. doi:10.1186/s13059-017-1349-1.
  • Python wrapper:
    Please reference this GitHub repository (URL and commit) in the Methods section as the implementation used for running xCell.

About

xCell framework, enabling expert-level tumor microenvironment cell-type enrichment and TME index profiling from bulk transcriptomic data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages