Skip to content

Nextflow pipeline for pan-cancer immunotherapy biomarker discovery using transcriptomic data

Notifications You must be signed in to change notification settings

bhklab/PredictIO_Nextflow

Repository files navigation

PredictioR Nextflow Pipeline

Overview

The PredictioR Nextflow (PredictioR-NF) pipeline is a scalable, end-to-end workflow for immunotherapy biomarker discovery across multiple cancer cohorts. It is implemented in Nextflow and runs in Docker for reproducible and portable analyses.

PredictioR-NF accepts input data as Bioconductor SummarizedExperiment (.rda, recommended) or paired expression and clinical CSV files. For each cohort, it performs gene-level and gene-signature association testing, and can optionally aggregate results across cohorts using pan-cancer and cancer-specific meta-analysis.

The main workflow (main.nf) consists of three sequential analysis stages:

  • Gene-level analysis
  • Signature-level analysis
  • Meta-analysis (optional)

Quickstart

Step 1: Install Nextflow and Docker

Before you start (recommended environment)

  • Linux/macOS: supported.
  • Windows: use WSL2 (Ubuntu) + Docker Desktop (recommended).
  • Avoid on Windows: Git Bash / MINGW64 (can cause Nextflow terminal/signal issues).

Requirements: Java ≥ 11 (Java 17 recommended; tested with 17.0.x), Nextflow ≥ 24 (tested with 25.10.2), Docker Engine/Docker Desktop ≥ 20.10 (tested with 27.5.1).

Sanity checks:

java -version
nextflow -version
docker version

Nextflow

Docker

Pull the image:

docker pull bhklab/nextflow-env

Step 2: Project Structure

Before running the pipeline, the project directory should contain:

.
├── main.nf                 # Nextflow workflow (gene + signature association + meta-analysis)
├── nextflow.config         # Profiles/resources + Docker settings
├── ICB_data/               # Cohort inputs: *.rda (SE mode) or *_expr.csv + *_clin.csv (CSV mode)
├── SIG_data/               # Signature .rda files (each loads a `sig` data frame)
├── sig_summary_info/       # Signature metadata (signature_information.csv)
└── output/                 # Results (auto-created): studies/<study_id>/ and meta/

Step 3: Prepare input data

Each cohort (in either SummarizedExperiment or CSV mode) is expected to represent a single cancer type and a single treatment category.

FAIR data note: PredictioR-NF assumes standardized, well-annotated inputs to enable reproducible analyses and reuse across cohorts. We recommend SummarizedExperiment to keep molecular assays, sample metadata, and feature annotations together, with consistent sample IDs and harmonized clinical endpoint variables.

Curation standards: Clinical variables and genomic metadata were curated and harmonized using mCODE concepts where applicable, and aligned with ICGC/ICGC-ARGO conventions (e.g., consistent variable naming, controlled vocabularies, and cohort metadata structure).

3.1 Gene-level input (ICB_data/)

3.1.1 SummarizedExperiment mode (default; recommended)

  • Input: Bioconductor SummarizedExperiment objects stored as .rda files

  • These objects enable standardized handling of:

    • Gene expression data
    • Clinical annotations
    • Immunotherapy outcome variables

Example input files:

  • ICB_small_Hugo.rda
  • ICB_small_Mariathasan.rda

Example datasets directory: bhklab/PredictioR/tree/main/data

SummarizedExperiment documentation: SummarizedExperiment.html

3.1.2 CSV mode

CSV mode enables analysis of custom cohorts without requiring a SummarizedExperiment.

Expression CSV

  • Genes × samples matrix
  • Rows = gene identifiers
  • Columns = sample IDs

Clinical CSV

  • One row per sample
  • Sample identifiers must align to expression column names

Mandatory sample-matching requirement Expression column names must exactly match clinical sample identifiers (order does not need to match).

Required clinical columns
Column Description
cancer_type Cancer type (single unique value per cohort)
treatment Treatment type (single unique value per cohort)
response Response (R / NR)
survival_time_os Overall survival time
survival_time_pfs Progression-free survival time
event_occurred_os OS event indicator (1 = event, 0 = censored)
event_occurred_pfs PFS event indicator (1 = event, 0 = censored)

Endpoints and definitions

  • survival_time_os and survival_time_pfs are in months.
  • response is encoded as R (responder) vs NR (non-responder), following the PMID: 36055464.

Additional recommended columns include patientid, tissueid, survival_unit, sex, age, histology, and stage.

3.2 Signature-level input (SIG_data/)

  • Contains .rda files storing a data frame named sig

Example signature files:

  • CYT_Rooney.rda
  • EMT_Thompson.rda
  • PredictIO_Bareche.rda

Typical columns in sig:

  • signature_name: Name of the signature
  • gene_name: Name of the gene
  • weight: Weight assigned to each gene

Signature metadata (scoring method, algorithm type) is read from: signature_information.csv.

Signature definitions are sourced from: bhklab/SignatureSets

Full signature metadata (50+ signatures) is available at: bhklab/SignatureSets/tree/main/data-raw

Curation note: All signatures are fully curated and standardized, with gene identifiers, weights, and scoring methods harmonized across cohorts to enable reproducible and comparable signature scoring.

Please follow the same format for consistency.

Step 4: Run the PredictioR Nextflow pipeline

Run the pipeline from the project root.

General usage

nextflow run main.nf -profile standard \
  --input_mode se|se_all|csv|csv_all \
  --study <study_id(s)|ALL> \              # SE / SE_ALL / CSV_ALL
  --study_id <custom_study_name> \         # CSV only
  --expr_csv <expression_basename> \       # CSV only
  --clin_csv <clinical_basename> \         # CSV only
  --gene <R_gene_vector> \
  --sigs <comma-separated signatures|ALL> \
  --icb_data_dir ./ICB_data \
  --sig_data_dir ./SIG_data \
  --sig_summary_dir ./sig_summary_info \
  --out_dir ./output \
  --run_meta true|false

Note: PredictioR-NF can run gene-level analysis (--gene), signature-level analysis (--sigs), or both.
You must provide at least one of --gene or --sigs. Meta-analysis runs only with --run_meta true and ≥ 3 cohorts (multi-cohort input modes only).

Examples

Example 1: SE mode (default), single cohort, gene-only analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ICB_small_Liu \
  --gene 'c("CXCL9","CXCL10","STAT1","CD8A")' \

Example 2: SE mode (default), multi-cohort, signatures-only analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ICB_small_Liu,ICB_small_Miao,ICB_small_Van_Allen,ICB_small_Padron \
  --sigs CYT_Rooney,Teff_McDermott \
  --run_meta true

Example 3: SE mode (default), ALL cohorts, gene + ALL signatures, meta-analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ALL \
  --gene 'c("CXCL9","CXCL10","STAT1","CD8A")' \
  --sigs ALL \
  --run_meta true

Example 4: CSV mode, single cohort, gene-only analysis

nextflow run main.nf -profile standard \
  --input_mode csv \
  --study_id ICB_small_Liu \
  --expr_csv ICB_small_Liu_expr \
  --clin_csv ICB_small_Liu_clin \
  --gene 'c("CXCL9")'

Step 5: Review and interpret outputs

All outputs are written to --out_dir (default: ./output).

output/
├── studies/
│   └── <study_id>/ 
└── meta/
  • Per-cohort outputs: organized by cohort ID include extracted inputs, gene-level association results, signature scores, and signature-level association results.
  • Meta-analysis outputs: include pan-cancer and per-cancer summary tables.

Step 6: Analyses performed

Gene identifier harmonization

  • Acceptable gene identifiers: Ensembl Gene ID, Entrez Gene ID, or HGNC/HUGO gene symbol
  • All expression matrices and signature gene lists will be mapped to a common identifier (one chosen per analysis run)
  • Expression and signatures must be aligned after mapping (consistent gene universe, duplicates handled, unmapped genes flagged/removed)

Gene-level analysis

  • Endpoints: overall survival (OS), progression-free survival (PFS), response (R vs NR)
  • OS/PFS: Cox proportional hazards models
  • Response: logistic regression
  • Multiple-testing control: Benjamini–Hochberg FDR

Signature-level analysis

  • Signature scoring: GSVA, ssGSEA, weighted mean, or signature-specific methods
  • Association testing with OS, PFS, and response
  • Multiple-testing control: Benjamini–Hochberg FDR

Meta-analysis (optional)

  • Random-effects meta-analysis across cohorts
  • Per-cancer meta-analysis when sufficient cohorts/samples are available
  • Performed separately for gene-level and signature-level results

Note: Results may be NA for cohorts/endpoints with missing data, insufficient samples/events, or unmapped genes/signatures (see “Missing values (NA) in outputs”).

Step 7: Reference Resources

Input Data Specifications

ICB Data Information

This table summarizes each dataset by treatment type, cancer type(s), available clinical and molecular data, and the relevant PMID references. The required columns are treatment and cancer type.

Dataset Patients [#] Cancer type Treatment Clinical endpoints Molecular data PMID
ICB_small_Hugo 27 Melanoma PD-1/PD-L1 OS RNA 26997480
ICB_small_Liu 121 Melanoma PD-1/PD-L1 PFS/OS RNA/DNA 31792460
ICB_small_Miao 33 Kidney PD-1/PD-L1 PFS/OS RNA/DNA 29301960
ICB_small_Nathanson 24 Melanoma CTLA4 OS RNA/DNA 27956380
ICB_small_Padron 45 Pancreas PD-1/PD-L1 PFS/OS RNA 35662283
ICB_small_Riaz 46 Melanoma PD-1/PD-L1 OS RNA/DNA 29033130
ICB_small_Van_Allen 42 Melanoma CTLA4 PFS/OS RNA/DNA 26359337
ICB_small_Mariathasan 195 Bladder PD-1/PD-L1 OS RNA/DNA 29443960

Additional Notes

  • Required R packages and dependencies are installed as specified in load_libraries.R and included in the BHK Docker image
  • Customize nextflow.config to specify any additional parameters or configurations required for your specific analysis needs

Contact

For questions or support, contact: nasim.bondarsahebi@uhn.ca, farnoosh.abbasaghababazadeh@uhn.ca

About

Nextflow pipeline for pan-cancer immunotherapy biomarker discovery using transcriptomic data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •