MPRA Pipeline: end-to-end processing, modeling, and comparison of MPRA data

A modular, reproducible pipeline for Massively Parallel Reporter Assays (MPRA), from raw sequencing reads through count summarization, statistical modeling, and per-condition comparisons. This README provides setup instructions, configuration details, and step-by-step usage examples.

Project Overview
Prerequisites
Repository Layout
Download and Setup
Usage: Pipeline Wrapper (pipeline.sh)
Usage: Step-by-Step Commands
1. Matching Oligos to Barcodes (run_match.sh)
2. Counting Barcodes (run_count.sh)
3. Modeling Activity (run_model.sh)
4. Condition Comparisons (run_compare.sh)
Input/Output File Descriptions
Troubleshooting
Collaboration
License & Citation

Project Overview

This MPRA pipeline processes paired‐end or single‐end FASTQ reads from reporter assays:

Match reads to reference oligo sequences and extract barcodes.
Count barcode occurrences across replicates and conditions.
Model RNA vs. DNA counts to estimate activity (log₂ RNA/DNA) per oligo using DESeq2–based normalization, dispersion estimation, and optional summit shift.
Compare activity between user‐defined groups (e.g., treatment vs. control) with robust statistical testing.

All steps are implemented as modular scripts, orchestrated by a job‐submission wrapper for HPC clusters (SGE).

Prerequisites

• Linux or macOS • Conda (Miniconda or Anaconda) • SGE or compatible job scheduler (optional; pipeline can run locally if desired).

Repository Layout

MPRA/
├── config/                                  # all user‐provided configuration
│   ├── acc_id.txt                           # sample fastq ↔ replicate ID mappings
│   ├── comparisons.tsv                      # which groups to compare (Comparison, Group1, Group2)
│   └── settings.sh                          # MUST EDIT: PROJECT_NAME, PROJECT_SUFFIX,
│                                              CONDA_INIT, SCC_PROJ
├── data/
│   ├── library/                             # place your cloned‐library FASTA & controls here
│   └── samples/                             # place your raw plasmid/RNA FASTQs here
│
├── logs/                                    # auto‐generated qsub stdout/err files
│
├── results/                                 # pipeline outputs by step
│   ├── 01_match/                            # reconstructed oligo-barcode mapping
│   ├── 02_count/                            # per-replicate count tables & QC
│   ├── 03_model/                            # DESeq2 results, normalized counts, plots
│   └── 04_compare/                          # pairwise comparison TSVs & summaries
│
├── scripts/                                 # utility Python scripts (don’t edit)
│   ├── associate_tags.py
│   ├── bc_raw.py
│   ├── compile_bc_cs.py
│   ├── count_qc.py
│   ├── ct_seq.py
│   ├── make_attributes_oligo.py
│   ├── make_counts.py
│   ├── make_infile.py
│   ├── make_project_list.py
│   ├── map_project_annot_fastq.py
│   ├── mapping_qc_plots.py
│   ├── parse_map.py
│   ├── pull_barcodes.py
│   ├── read_stats.py
│   └── sam2mpra_cs.py
│
├── src/                                     # entrypoints for each pipeline step
│   ├── 01_MPRA_match/
│   │   ├── match.py                         # core matching logic
│   │   └── run_match.sh                     # wrapper script
│   ├── 02_MPRA_count/
│   │   ├── count.py                         # count aggregation logic
│   │   └── run_count.sh		     # wrapper script
│   ├── 03_MPRA_model/
│   │   ├── model.r                          # R script doing DESeq2 modeling
│   │   └── run_model.sh		     # wrapper script
│   └── 04_MPRA_compare/
│       ├── compare.r                        # R script for group1 vs group2 tests
│       └── run_compare.sh		     # wrapper script
│
├── .gitignore                               # ignore logs, results, etc.
├── env.yml                                  # conda environment spec (run `setup.sh`)
├── pipeline.sh                              # submits each step via qsub (calls run_*.sh)
├── README.md                                # ← you’re here: this overview and instructions
└── setup.sh                                 # checks conda, prompts for your 4 settings,
                                              creates env, makes sure scripts are executable

Download and Setup

1. Clone the repository

git clone https://github.com/FuxmanBass-lab/MPRA.git
cd MPRA

3. Install Conda environment

Make sure you have Conda (Miniconda or Anaconda) installed and on your $PATH.

./setup.sh

What setup.sh does:

Prompts you to confirm (or supply) the following in config/settings.sh:
- PROJECT_NAME: your MPRA project identifier (e.g. OL49)
- ROJECT_SUFFIX: run‐specific tag (e.g. date or batch)
- CONDA_INIT: path to your conda.sh (e.g. ~/miniconda3/etc/profile.d/conda.sh)
- CC_PROJ: your SCC/cluster project name for qsub (e.g. vcres)
Auto‐detects and sets $BASE_DIR for you.
Verifies conda is available.
Creates the Conda environment from env.yml if it doesn’t already exist.
Checks that all wrapper scripts (run_match.sh, etc.) and pipeline.sh are executable.

5. Review and customize

Open config/settings.sh in your editor and ensure the four variables above are correctly set for your system. Do not modify other lines unless necessary.

7. Verify the layout

Ensure you have placed:

Library FASTA and FASTQ(s) in data/library/
Sample FASTQ(s) in data/samples/
acc_id.txt and comparisons.tsv in config/

9. You are ready to run

Proceed to Usage to start the pipeline steps.

Usage: Pipeline Wrapper

Once your config/settings.sh is configured, you can launch one or all steps via the high-level wrapper. This will submit each step as a job to the cluster (qsub):

# Run only the matching step:
./pipeline.sh match

# Run only the counting step:
./pipeline.sh count

# Run only the modeling step:
./pipeline.sh model

# Run only the comparison step:
./pipeline.sh compare

# Run all steps in sequence:
./pipeline.sh all

This submits one or more jobs to your SCC cluster with names MPRAmatch, MPRAcount, MPRAmodel, MPRAcompare, or MPRAall. Each job sources your Conda environment and invokes the corresponding run_*.sh script under src/. You can monitor each job’s progress by inspecting the log files in the top-level logs/ directory.

Usage: Step-by-Step Commands

If you prefer to run each stage manually (or debug a single step), here are the direct commands and their required inputs:

1. Matching Oligos to Barcodes

cd src/01_MPRA_match
./run_match.sh

Inputs:

READ1, READ2 FASTQ files
Reference oligo fasta

Output:

Merged .match files, barcode–oligo pairs
QC plots

2. Counting Barcodes

cd ../02_MPRA_count
./run_count.sh

Inputs:

acc_id.txt (sample ↔ replicate ↔ cell-type ↔ RNA/DNA map)
Parsed .parsed file from match step
Raw FASTQ replicates

Output:

Per-replicate .count tables
condition table

3. Modeling Activity

cd ../03_MPRA_model
./run_model.sh

Inputs:

Barcode count table (.count)
Attributes and condition files
Negative/positive control and experimental tiles/oligos FASTAs

Output:

Global and Per cell-type Normalized counts (*_normalized_counts.tsv)
Per cell-type activity (*_activity.tsv)
DESeq2 results, bed files, session info
QC and visualizations

4. Condition Comparisons

cd ../04_MPRA_compare
./run_compare.sh

Inputs:

Comparison design (config/comparisons.tsv)
Normalized counts TSV

Output:

comparison_.tsv (log₂FC, p-values)
comparison__with_replicate_activity.tsv

Input/Output Files Discriptions

Below is a summary of the key files used and generated by each stage of the pipeline. All paths are relative to the project root unless noted.

Configuration & Metadata

File	Purpose
`config/settings.sh`	User‐editable shell variables (paths, project identifiers)
`config/acc_id.txt`	Four-column table: `<fastq_filename> <replicate_id> <cell type> <DNA/RNA>`
`config/comparisons.tsv`	Three columns: `Comparison Name`, `Group1`, `Group2` (comma-separated)

Raw Data

File or Directory	Purpose
`data/library/`	Reference FASTA files: oligo templates, plasmid controls, etc.
`data/samples/`	Raw FASTQ files for each biological replicate (single/paired end)

Step 1: Match (Oligo ↔ Barcode Reconstruction)

File or Pattern	Generated By	Description
`results/01_match/*.match`	`run_match.sh`	Interleaved (flashed) reads with pulled oligo/barcode pairs
`results/01_match/*.parsed`	helper scripts	Tab-delimited: oligo name, barcode, CIGAR, mapping quality
`results/01_match/*_barcode_QC.pdf`	helper scripts	Diagnostic plots of barcode counts per oligo

Step 2: Count (Barcode Quantification)

File or Pattern	Generated By	Description
`results/02_count/<ID_OUT>.count`	`run_count.sh`	Matrix of raw barcode counts: rows=oligo/barcode, cols=replicates
`results/02_count/*_barcode_QC.pdf`	`count_qc.py`	Per-replicate barcode distribution and QC plots
`results/02_count/*_count_QC.pdf`	`count_qc.py`	Per-replicate count distribution and QC plots
`results/02_count/*_read_stats.pdf`	`count_qc.py`	Distribution of number of barcodes and counts accross replicates
`results/02_count/<ID_OUT>_condition.txt`	`run_count.sh`	Two-column table: `<replicate_id> <condition>` for modeling

Step 3: Model (DESeq2 Normalization & Testing)

File or Pattern	Generated By	Description
`results/03_model/out/<ID_OUT>_normalized_counts.tsv`	`model.r`	Size-factor–normalized and summit shifted count matrix (DNA+RNA)
`results/03_model/out/<ID_OUT>_<cell-type>_normalized_counts.tsv`	`model.r`	Per cell-type Size-factor–normalized and summit shifted count matrix (DNA+RNA)
`results/03_model/out/<ID_OUT>_<cell-type>_activity.tsv`	`model.r`	Per cell-type activity statistics
`results/03_model/out/results/*.bed`	`model.r`	BED files of significant oligo hits (optional)
`results/03_model/out/sessionInfo.txt`	`run_model.sh`	R session details for reproducibility
`results/03_model/out/plots/*_cor.png`	`model.r`	Replicates correlation plots
`results/03_model/out/plots/*_logFC_<cell_type>_controls.pdf`	`model.r`	Distribution of activity of negative and positive control tiles and experimental tiles
`results/03_model/out/plots/*_logFC_<cell_type>.pdf`	`model.r`	Distribution of activity of all tiles

Step 4: Compare (Per-Condition Differential Analysis)

File or Pattern	Generated By	Description
`results/04_compare/comparison_<Comparison>.tsv`	`run_compare.sh`	DESeq2 output: `ID`, `dna_mean`, `grp1_mean`, `grp2_mean`, `log2FoldChange`, `pvalue`, `padj`
`results/04_compare/comparison_<Comparison>_with_replicate_activity.tsv`	`run_compare.sh`	Above plus per-replicate log₂(activity) columns

Logs

File or Directory	Purpose
`logs/.out` / `logs/.err`	Standard output and error from each `pipeline.sh`–submitted job

Troubleshooting

If you run into any issues while using this pipeline, try the following:

Check the scheduler logs
- Symptom: Jobs appear to finish but downstream steps produce no output or unexpected errors.
- Solution: Inspect logs/*.out and logs/*.err for each step’s stdout/stderr. They often contain the full error trace or warning messages.
conda or environment errors
- Symptom: conda: command not found or missing packages.
- Solution: Make sure you have Miniconda or Anaconda installed and that your CONDA_INIT path in config/settings.sh points to the correct conda.sh. Then re‐run setup.sh to create the environment from env.yml.
Missing or misnamed files
- Symptom: “ERROR: cannot find file” or “No such file or directory” when running any stage.
- Solution:
  - Verify that you’ve populated config/settings.sh with correct BASE_DIR, PROJECT_NAME, PROJECT_SUFFIX, CONDA_INIT and SCC_PROJ.
  - Ensure your raw data are placed under data/library/ (reference FASTA) and data/samples/ (FASTQ).
  - Confirm that config/acc_id.txt and config/comparisons.tsv exist and are in the expected format.
Job submission failures on the cluster
- Symptom: qsub errors, “Colon not allowed in objectname.”
- Solution:
  - Check that pipeline.sh, run_*.sh, and all helper scripts are executable (chmod +x).
  - Make sure your SCC_PROJ matches your cluster project name.
Unexpected NA values in comparison output
- Symptom: NA in padj or pvalue.
- Solution:
  - This may occur if a feature has zero variance across groups or insufficient counts. You can disable independent filtering in compare.r (e.g. results(dds, independentFiltering=FALSE)), but interpret with caution.

If none of the above resolve your issue, please open an issue on the repository and include:

The exact command you ran
A copy of the terminal output or error message
Contents of config/settings.sh, config/acc_id.txt, and (if relevant) config/comparisons.tsv

Collaboration

We welcome contributions and feedback. To collaborate:

Report issues: Open a GitHub issue with a clear description and relevant details (error messages, commands, etc.).
Propose features: Start a discussion or issue outlining your idea before coding.
Submit pull requests:
- Fork the repo and create a branch.
- Reference related issues in your commits.
- Include tests or examples and update documentation.
Follow style: Adhere to existing conventions and update README or comments as needed.

All contributions are reviewed—thank you for helping improve the pipeline!

License & Citation

This pipeline is released under the MIT License. See LICENSE for details.

If you use this workflow, please cite the GitHub repository:

“MPRA Pipeline: end-to-end processing, modeling, and comparison of MPRA data,” GitHub, https://github.com/FuxmanBass-lab/MPRA (accessed YYYY-MM-DD).

This pipeline is adapted from the MPRASuite by the Tewhey Lab:

Tewhey Lab MPRASuite MPRAmodel, GitHub, https://github.com/tewhey-lab/MPRASuite/tree/main/MPRAmodel (accessed YYYY-MM-DD).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPRA Pipeline: end-to-end processing, modeling, and comparison of MPRA data

Table of Contents

Project Overview

Prerequisites

Repository Layout

Download and Setup

Usage: Pipeline Wrapper

Usage: Step-by-Step Commands

1. Matching Oligos to Barcodes

2. Counting Barcodes

3. Modeling Activity

4. Condition Comparisons

Input/Output Files Discriptions

Configuration & Metadata

Raw Data

Step 1: Match (Oligo ↔ Barcode Reconstruction)

Step 2: Count (Barcode Quantification)

Step 3: Model (DESeq2 Normalization & Testing)

Step 4: Compare (Per-Condition Differential Analysis)

Logs

Troubleshooting

Collaboration

License & Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
config		config
data		data
logs		logs
results		results
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
pipeline.sh		pipeline.sh
setup.sh		setup.sh

License

FuxmanBass-lab/MPRA

Folders and files

Latest commit

History

Repository files navigation

MPRA Pipeline: end-to-end processing, modeling, and comparison of MPRA data

Table of Contents

Project Overview

Prerequisites

Repository Layout

Download and Setup

Usage: Pipeline Wrapper

Usage: Step-by-Step Commands

1. Matching Oligos to Barcodes

2. Counting Barcodes

3. Modeling Activity

4. Condition Comparisons

Input/Output Files Discriptions

Configuration & Metadata

Raw Data

Step 1: Match (Oligo ↔ Barcode Reconstruction)

Step 2: Count (Barcode Quantification)

Step 3: Model (DESeq2 Normalization & Testing)

Step 4: Compare (Per-Condition Differential Analysis)

Logs

Troubleshooting

Collaboration

License & Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages