PredictioR Nextflow Pipeline

Overview

The PredictioR Nextflow (PredictioR-NF) pipeline is a scalable, end-to-end workflow for immunotherapy biomarker discovery across multiple cancer cohorts. It is implemented in Nextflow and runs in Docker for reproducible and portable analyses.

PredictioR-NF accepts input data as Bioconductor SummarizedExperiment (.rda, recommended) or paired expression and clinical CSV files. For each cohort, it performs gene-level and gene-signature association testing, and can optionally aggregate results across cohorts using pan-cancer and cancer-specific meta-analysis.

The main workflow (main.nf) consists of three sequential analysis stages:

Gene-level analysis
Signature-level analysis
Meta-analysis (optional)

Quickstart

Step 1: Install Nextflow and Docker
Step 2: Project structure
Step 3: Prepare input data
Step 4: Run the PredictioR Nextflow pipeline
Step 5: Review and interpret outputs
Step 6: Analyses performed
Step 7: Reference resources

Step 1: Install Nextflow and Docker

Before you start (recommended environment)

Linux/macOS: supported.
Windows: use WSL2 (Ubuntu) + Docker Desktop (recommended).
Avoid on Windows: Git Bash / MINGW64 (can cause Nextflow terminal/signal issues).

Requirements: Java ≥ 11 (Java 17 recommended; tested with 17.0.x), Nextflow ≥ 24 (tested with 25.10.2), Docker Engine/Docker Desktop ≥ 20.10 (tested with 27.5.1).

Sanity checks:

java -version
nextflow -version
docker version

Nextflow

Setup: https://www.nextflow.io/docs/latest/install.html
Documentation: https://www.nextflow.io/docs/latest/index.html
Training: https://training.nextflow.io

Docker

Purpose: Ensures reproducible execution by containerizing the full runtime environment
Install: https://docs.docker.com/get-docker/
PredictioR Docker image: bhklab/nextflow-env
Docker Hub: https://hub.docker.com/r/bhklab/nextflow-env

Pull the image:

docker pull bhklab/nextflow-env

Step 2: Project Structure

Before running the pipeline, the project directory should contain:

.
├── main.nf                 # Nextflow workflow (gene + signature association + meta-analysis)
├── nextflow.config         # Profiles/resources + Docker settings
├── ICB_data/               # Cohort inputs: *.rda (SE mode) or *_expr.csv + *_clin.csv (CSV mode)
├── SIG_data/               # Signature .rda files (each loads a `sig` data frame)
├── sig_summary_info/       # Signature metadata (signature_information.csv)
└── output/                 # Results (auto-created): studies/<study_id>/ and meta/

Step 3: Prepare input data

Each cohort (in either SummarizedExperiment or CSV mode) is expected to represent a single cancer type and a single treatment category.

FAIR data note: PredictioR-NF assumes standardized, well-annotated inputs to enable reproducible analyses and reuse across cohorts. We recommend SummarizedExperiment to keep molecular assays, sample metadata, and feature annotations together, with consistent sample IDs and harmonized clinical endpoint variables.

Curation standards: Clinical variables and genomic metadata were curated and harmonized using mCODE concepts where applicable, and aligned with ICGC/ICGC-ARGO conventions (e.g., consistent variable naming, controlled vocabularies, and cohort metadata structure).

3.1 Gene-level input (`ICB_data/`)

3.1.1 SummarizedExperiment mode (default; recommended)

Input: Bioconductor SummarizedExperiment objects stored as .rda files
These objects enable standardized handling of:
- Gene expression data
- Clinical annotations
- Immunotherapy outcome variables

Example input files:

ICB_small_Hugo.rda
ICB_small_Mariathasan.rda

Example datasets directory: bhklab/PredictioR/tree/main/data

SummarizedExperiment documentation: SummarizedExperiment.html

3.1.2 CSV mode

CSV mode enables analysis of custom cohorts without requiring a SummarizedExperiment.

Expression CSV

Genes × samples matrix
Rows = gene identifiers
Columns = sample IDs

Clinical CSV

One row per sample
Sample identifiers must align to expression column names

Mandatory sample-matching requirement Expression column names must exactly match clinical sample identifiers (order does not need to match).

Required clinical columns

Column	Description
`cancer_type`	Cancer type (single unique value per cohort)
`treatment`	Treatment type (single unique value per cohort)
`response`	Response (`R` / `NR`)
`survival_time_os`	Overall survival time
`survival_time_pfs`	Progression-free survival time
`event_occurred_os`	OS event indicator (1 = event, 0 = censored)
`event_occurred_pfs`	PFS event indicator (1 = event, 0 = censored)

Endpoints and definitions

survival_time_os and survival_time_pfs are in months.
response is encoded as R (responder) vs NR (non-responder), following the PMID: 36055464.

Additional recommended columns include patientid, tissueid, survival_unit, sex, age, histology, and stage.

3.2 Signature-level input (`SIG_data/`)

Contains .rda files storing a data frame named sig

Example signature files:

CYT_Rooney.rda
EMT_Thompson.rda
PredictIO_Bareche.rda

Typical columns in sig:

signature_name: Name of the signature
gene_name: Name of the gene
weight: Weight assigned to each gene

Signature metadata (scoring method, algorithm type) is read from: signature_information.csv.

Signature definitions are sourced from: bhklab/SignatureSets

Full signature metadata (50+ signatures) is available at: bhklab/SignatureSets/tree/main/data-raw

Curation note: All signatures are fully curated and standardized, with gene identifiers, weights, and scoring methods harmonized across cohorts to enable reproducible and comparable signature scoring.

Please follow the same format for consistency.

Step 4: Run the PredictioR Nextflow pipeline

Run the pipeline from the project root.

General usage

nextflow run main.nf -profile standard \
  --input_mode se|se_all|csv|csv_all \
  --study <study_id(s)|ALL> \              # SE / SE_ALL / CSV_ALL
  --study_id <custom_study_name> \         # CSV only
  --expr_csv <expression_basename> \       # CSV only
  --clin_csv <clinical_basename> \         # CSV only
  --gene <R_gene_vector> \
  --sigs <comma-separated signatures|ALL> \
  --icb_data_dir ./ICB_data \
  --sig_data_dir ./SIG_data \
  --sig_summary_dir ./sig_summary_info \
  --out_dir ./output \
  --run_meta true|false

Note: PredictioR-NF can run gene-level analysis (--gene), signature-level analysis (--sigs), or both.
You must provide at least one of --gene or --sigs. Meta-analysis runs only with --run_meta true and ≥ 3 cohorts (multi-cohort input modes only).

Examples

Example 1: SE mode (default), single cohort, gene-only analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ICB_small_Liu \
  --gene 'c("CXCL9","CXCL10","STAT1","CD8A")' \

Example 2: SE mode (default), multi-cohort, signatures-only analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ICB_small_Liu,ICB_small_Miao,ICB_small_Van_Allen,ICB_small_Padron \
  --sigs CYT_Rooney,Teff_McDermott \
  --run_meta true

Example 3: SE mode (default), ALL cohorts, gene + ALL signatures, meta-analysis

nextflow run main.nf -profile standard \
  --input_mode se \
  --study ALL \
  --gene 'c("CXCL9","CXCL10","STAT1","CD8A")' \
  --sigs ALL \
  --run_meta true

Example 4: CSV mode, single cohort, gene-only analysis

nextflow run main.nf -profile standard \
  --input_mode csv \
  --study_id ICB_small_Liu \
  --expr_csv ICB_small_Liu_expr \
  --clin_csv ICB_small_Liu_clin \
  --gene 'c("CXCL9")'

Step 5: Review and interpret outputs

All outputs are written to --out_dir (default: ./output).

output/
├── studies/
│   └── <study_id>/ 
└── meta/

Per-cohort outputs: organized by cohort ID include extracted inputs, gene-level association results, signature scores, and signature-level association results.
Meta-analysis outputs: include pan-cancer and per-cancer summary tables.

Step 6: Analyses performed

Gene identifier harmonization

Acceptable gene identifiers: Ensembl Gene ID, Entrez Gene ID, or HGNC/HUGO gene symbol
All expression matrices and signature gene lists will be mapped to a common identifier (one chosen per analysis run)
Expression and signatures must be aligned after mapping (consistent gene universe, duplicates handled, unmapped genes flagged/removed)

Gene-level analysis

Endpoints: overall survival (OS), progression-free survival (PFS), response (R vs NR)
OS/PFS: Cox proportional hazards models
Response: logistic regression
Multiple-testing control: Benjamini–Hochberg FDR

Signature-level analysis

Signature scoring: GSVA, ssGSEA, weighted mean, or signature-specific methods
Association testing with OS, PFS, and response
Multiple-testing control: Benjamini–Hochberg FDR

Meta-analysis (optional)

Random-effects meta-analysis across cohorts
Per-cancer meta-analysis when sufficient cohorts/samples are available
Performed separately for gene-level and signature-level results

Note: Results may be NA for cohorts/endpoints with missing data, insufficient samples/events, or unmapped genes/signatures (see “Missing values (NA) in outputs”).

Step 7: Reference Resources

GitHub repository: https://github.com/bhklab/PredictioR
Associated publication: Leveraging big data of immune checkpoint blockade response identifies novel potential targets

Input Data Specifications

ICB Data Information

This table summarizes each dataset by treatment type, cancer type(s), available clinical and molecular data, and the relevant PMID references. The required columns are treatment and cancer type.

Dataset	Patients [#]	Cancer type	Treatment	Clinical endpoints	Molecular data	PMID
ICB_small_Hugo	27	Melanoma	PD-1/PD-L1	OS	RNA	26997480
ICB_small_Liu	121	Melanoma	PD-1/PD-L1	PFS/OS	RNA/DNA	31792460
ICB_small_Miao	33	Kidney	PD-1/PD-L1	PFS/OS	RNA/DNA	29301960
ICB_small_Nathanson	24	Melanoma	CTLA4	OS	RNA/DNA	27956380
ICB_small_Padron	45	Pancreas	PD-1/PD-L1	PFS/OS	RNA	35662283
ICB_small_Riaz	46	Melanoma	PD-1/PD-L1	OS	RNA/DNA	29033130
ICB_small_Van_Allen	42	Melanoma	CTLA4	PFS/OS	RNA/DNA	26359337
ICB_small_Mariathasan	195	Bladder	PD-1/PD-L1	OS	RNA/DNA	29443960

Additional Notes

Required R packages and dependencies are installed as specified in load_libraries.R and included in the BHK Docker image
Customize nextflow.config to specify any additional parameters or configurations required for your specific analysis needs

Contact

For questions or support, contact: nasim.bondarsahebi@uhn.ca, farnoosh.abbasaghababazadeh@uhn.ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredictioR Nextflow Pipeline

Overview

Quickstart

Step 1: Install Nextflow and Docker

Nextflow

Docker

Step 2: Project Structure

Step 3: Prepare input data

3.1 Gene-level input (`ICB_data/`)

3.1.1 SummarizedExperiment mode (default; recommended)

3.1.2 CSV mode

Required clinical columns

3.2 Signature-level input (`SIG_data/`)

Step 4: Run the PredictioR Nextflow pipeline

General usage

Examples

Step 5: Review and interpret outputs

Step 6: Analyses performed

Gene identifier harmonization

Gene-level analysis

Signature-level analysis

Meta-analysis (optional)

Step 7: Reference Resources

Input Data Specifications

ICB Data Information

Additional Notes

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
ICB_data		ICB_data
SIG_data		SIG_data
output		output
sig_summary_info		sig_summary_info
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
load_libraries.R		load_libraries.R
main.nf		main.nf
nextflow.config		nextflow.config
render_rmd.sh		render_rmd.sh

bhklab/PredictIO_Nextflow

Folders and files

Latest commit

History

Repository files navigation

PredictioR Nextflow Pipeline

Overview

Quickstart

Step 1: Install Nextflow and Docker

Nextflow

Docker

Step 2: Project Structure

Step 3: Prepare input data

3.1 Gene-level input (ICB_data/)

3.1.1 SummarizedExperiment mode (default; recommended)

3.1.2 CSV mode

Required clinical columns

3.2 Signature-level input (SIG_data/)

Step 4: Run the PredictioR Nextflow pipeline

General usage

Examples

Step 5: Review and interpret outputs

Step 6: Analyses performed

Gene identifier harmonization

Gene-level analysis

Signature-level analysis

Meta-analysis (optional)

Step 7: Reference Resources

Input Data Specifications

ICB Data Information

Additional Notes

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

3.1 Gene-level input (`ICB_data/`)

3.2 Signature-level input (`SIG_data/`)

Packages