A modular, reproducible pipeline for Massively Parallel Reporter Assays (MPRA), from raw sequencing reads through count summarization, statistical modeling, and per-condition comparisons. This README provides setup instructions, configuration details, and step-by-step usage examples.
- Project Overview
- Prerequisites
- Repository Layout
- Download and Setup
- Usage: Pipeline Wrapper (pipeline.sh)
- Usage: Step-by-Step Commands
- Matching Oligos to Barcodes (run_match.sh)
- Counting Barcodes (run_count.sh)
- Modeling Activity (run_model.sh)
- Condition Comparisons (run_compare.sh)
- Input/Output File Descriptions
- Troubleshooting
- Collaboration
- License & Citation
This MPRA pipeline processes paired‐end or single‐end FASTQ reads from reporter assays:
- Match reads to reference oligo sequences and extract barcodes.
- Count barcode occurrences across replicates and conditions.
- Model RNA vs. DNA counts to estimate activity (log₂ RNA/DNA) per oligo using DESeq2–based normalization, dispersion estimation, and optional summit shift.
- Compare activity between user‐defined groups (e.g., treatment vs. control) with robust statistical testing.
All steps are implemented as modular scripts, orchestrated by a job‐submission wrapper for HPC clusters (SGE).
• Linux or macOS • Conda (Miniconda or Anaconda) • SGE or compatible job scheduler (optional; pipeline can run locally if desired).
MPRA/
├── config/ # all user‐provided configuration
│ ├── acc_id.txt # sample fastq ↔ replicate ID mappings
│ ├── comparisons.tsv # which groups to compare (Comparison, Group1, Group2)
│ └── settings.sh # MUST EDIT: PROJECT_NAME, PROJECT_SUFFIX,
│ CONDA_INIT, SCC_PROJ
├── data/
│ ├── library/ # place your cloned‐library FASTA & controls here
│ └── samples/ # place your raw plasmid/RNA FASTQs here
│
├── logs/ # auto‐generated qsub stdout/err files
│
├── results/ # pipeline outputs by step
│ ├── 01_match/ # reconstructed oligo-barcode mapping
│ ├── 02_count/ # per-replicate count tables & QC
│ ├── 03_model/ # DESeq2 results, normalized counts, plots
│ └── 04_compare/ # pairwise comparison TSVs & summaries
│
├── scripts/ # utility Python scripts (don’t edit)
│ ├── associate_tags.py
│ ├── bc_raw.py
│ ├── compile_bc_cs.py
│ ├── count_qc.py
│ ├── ct_seq.py
│ ├── make_attributes_oligo.py
│ ├── make_counts.py
│ ├── make_infile.py
│ ├── make_project_list.py
│ ├── map_project_annot_fastq.py
│ ├── mapping_qc_plots.py
│ ├── parse_map.py
│ ├── pull_barcodes.py
│ ├── read_stats.py
│ └── sam2mpra_cs.py
│
├── src/ # entrypoints for each pipeline step
│ ├── 01_MPRA_match/
│ │ ├── match.py # core matching logic
│ │ └── run_match.sh # wrapper script
│ ├── 02_MPRA_count/
│ │ ├── count.py # count aggregation logic
│ │ └── run_count.sh # wrapper script
│ ├── 03_MPRA_model/
│ │ ├── model.r # R script doing DESeq2 modeling
│ │ └── run_model.sh # wrapper script
│ └── 04_MPRA_compare/
│ ├── compare.r # R script for group1 vs group2 tests
│ └── run_compare.sh # wrapper script
│
├── .gitignore # ignore logs, results, etc.
├── env.yml # conda environment spec (run `setup.sh`)
├── pipeline.sh # submits each step via qsub (calls run_*.sh)
├── README.md # ← you’re here: this overview and instructions
└── setup.sh # checks conda, prompts for your 4 settings,
creates env, makes sure scripts are executable
1. Clone the repository
git clone https://github.com/FuxmanBass-lab/MPRA.git
cd MPRA3. Install Conda environment
Make sure you have Conda (Miniconda or Anaconda) installed and on your $PATH.
./setup.shWhat setup.sh does:
- Prompts you to confirm (or supply) the following in config/settings.sh:
- PROJECT_NAME: your MPRA project identifier (e.g. OL49)
- ROJECT_SUFFIX: run‐specific tag (e.g. date or batch)
- CONDA_INIT: path to your conda.sh (e.g. ~/miniconda3/etc/profile.d/conda.sh)
- CC_PROJ: your SCC/cluster project name for qsub (e.g. vcres)
- Auto‐detects and sets $BASE_DIR for you.
- Verifies conda is available.
- Creates the Conda environment from env.yml if it doesn’t already exist.
- Checks that all wrapper scripts (run_match.sh, etc.) and pipeline.sh are executable.
5. Review and customize
Open config/settings.sh in your editor and ensure the four variables above are correctly set for your system. Do not modify other lines unless necessary.
7. Verify the layout
Ensure you have placed:
- Library FASTA and FASTQ(s) in data/library/
- Sample FASTQ(s) in data/samples/
- acc_id.txt and comparisons.tsv in config/
9. You are ready to run
Proceed to Usage to start the pipeline steps.
Once your config/settings.sh is configured, you can launch one or all steps via the high-level wrapper. This will submit each step as a job to the cluster (qsub):
# Run only the matching step:
./pipeline.sh match
# Run only the counting step:
./pipeline.sh count
# Run only the modeling step:
./pipeline.sh model
# Run only the comparison step:
./pipeline.sh compare
# Run all steps in sequence:
./pipeline.sh all
This submits one or more jobs to your SCC cluster with names MPRAmatch, MPRAcount, MPRAmodel, MPRAcompare, or MPRAall. Each job sources your Conda environment and invokes the corresponding run_*.sh script under src/. You can monitor each job’s progress by inspecting the log files in the top-level logs/ directory.
If you prefer to run each stage manually (or debug a single step), here are the direct commands and their required inputs:
cd src/01_MPRA_match
./run_match.shInputs:
- READ1, READ2 FASTQ files
- Reference oligo fasta
Output:
- Merged .match files, barcode–oligo pairs
- QC plots
cd ../02_MPRA_count
./run_count.shInputs:
- acc_id.txt (sample ↔ replicate ↔ cell-type ↔ RNA/DNA map)
- Parsed .parsed file from match step
- Raw FASTQ replicates
Output:
- Per-replicate .count tables
- condition table
cd ../03_MPRA_model
./run_model.sh
Inputs:
- Barcode count table (.count)
- Attributes and condition files
- Negative/positive control and experimental tiles/oligos FASTAs
Output:
- Global and Per cell-type Normalized counts (*_normalized_counts.tsv)
- Per cell-type activity (*_activity.tsv)
- DESeq2 results, bed files, session info
- QC and visualizations
cd ../04_MPRA_compare
./run_compare.shInputs:
- Comparison design (config/comparisons.tsv)
- Normalized counts TSV
Output:
- comparison_.tsv (log₂FC, p-values)
- comparison__with_replicate_activity.tsv
Below is a summary of the key files used and generated by each stage of the pipeline. All paths are relative to the project root unless noted.
| File | Purpose |
|---|---|
config/settings.sh |
User‐editable shell variables (paths, project identifiers) |
config/acc_id.txt |
Four-column table: <fastq_filename> <replicate_id> <cell type> <DNA/RNA> |
config/comparisons.tsv |
Three columns: Comparison Name, Group1, Group2 (comma-separated) |
| File or Directory | Purpose |
|---|---|
data/library/ |
Reference FASTA files: oligo templates, plasmid controls, etc. |
data/samples/ |
Raw FASTQ files for each biological replicate (single/paired end) |
| File or Pattern | Generated By | Description |
|---|---|---|
results/01_match/*.match |
run_match.sh |
Interleaved (flashed) reads with pulled oligo/barcode pairs |
results/01_match/*.parsed |
helper scripts | Tab-delimited: oligo name, barcode, CIGAR, mapping quality |
results/01_match/*_barcode_QC.pdf |
helper scripts | Diagnostic plots of barcode counts per oligo |
| File or Pattern | Generated By | Description |
|---|---|---|
results/02_count/<ID_OUT>.count |
run_count.sh |
Matrix of raw barcode counts: rows=oligo/barcode, cols=replicates |
results/02_count/*_barcode_QC.pdf |
count_qc.py |
Per-replicate barcode distribution and QC plots |
results/02_count/*_count_QC.pdf |
count_qc.py |
Per-replicate count distribution and QC plots |
results/02_count/*_read_stats.pdf |
count_qc.py |
Distribution of number of barcodes and counts accross replicates |
results/02_count/<ID_OUT>_condition.txt |
run_count.sh |
Two-column table: <replicate_id> <condition> for modeling |
| File or Pattern | Generated By | Description |
|---|---|---|
results/03_model/out/<ID_OUT>_normalized_counts.tsv |
model.r |
Size-factor–normalized and summit shifted count matrix (DNA+RNA) |
results/03_model/out/<ID_OUT>_<cell-type>_normalized_counts.tsv |
model.r |
Per cell-type Size-factor–normalized and summit shifted count matrix (DNA+RNA) |
results/03_model/out/<ID_OUT>_<cell-type>_activity.tsv |
model.r |
Per cell-type activity statistics |
results/03_model/out/results/*.bed |
model.r |
BED files of significant oligo hits (optional) |
results/03_model/out/sessionInfo.txt |
run_model.sh |
R session details for reproducibility |
results/03_model/out/plots/*_cor.png |
model.r |
Replicates correlation plots |
results/03_model/out/plots/*_logFC_<cell_type>_controls.pdf |
model.r |
Distribution of activity of negative and positive control tiles and experimental tiles |
results/03_model/out/plots/*_logFC_<cell_type>.pdf |
model.r |
Distribution of activity of all tiles |
| File or Pattern | Generated By | Description |
|---|---|---|
results/04_compare/comparison_<Comparison>.tsv |
run_compare.sh |
DESeq2 output: ID, dna_mean, grp1_mean, grp2_mean, log2FoldChange, pvalue, padj |
results/04_compare/comparison_<Comparison>_with_replicate_activity.tsv |
run_compare.sh |
Above plus per-replicate log₂(activity) columns |
| File or Directory | Purpose |
|---|---|
logs/*.out / logs/*.err |
Standard output and error from each pipeline.sh–submitted job |
If you run into any issues while using this pipeline, try the following:
-
Check the scheduler logs
- Symptom: Jobs appear to finish but downstream steps produce no output or unexpected errors.
- Solution: Inspect
logs/*.outandlogs/*.errfor each step’s stdout/stderr. They often contain the full error trace or warning messages.
-
condaor environment errors- Symptom:
conda: command not foundor missing packages. - Solution: Make sure you have Miniconda or Anaconda installed and that your
CONDA_INITpath inconfig/settings.shpoints to the correctconda.sh. Then re‐runsetup.shto create the environment fromenv.yml.
- Symptom:
-
Missing or misnamed files
- Symptom: “ERROR: cannot find file” or “No such file or directory” when running any stage.
- Solution:
- Verify that you’ve populated
config/settings.shwith correctBASE_DIR,PROJECT_NAME,PROJECT_SUFFIX,CONDA_INITandSCC_PROJ. - Ensure your raw data are placed under
data/library/(reference FASTA) anddata/samples/(FASTQ). - Confirm that
config/acc_id.txtandconfig/comparisons.tsvexist and are in the expected format.
- Verify that you’ve populated
-
Job submission failures on the cluster
- Symptom: qsub errors, “Colon not allowed in objectname.”
- Solution:
- Check that
pipeline.sh,run_*.sh, and all helper scripts are executable (chmod +x). - Make sure your
SCC_PROJmatches your cluster project name.
- Check that
-
Unexpected NA values in comparison output
- Symptom:
NAinpadjorpvalue. - Solution:
- This may occur if a feature has zero variance across groups or insufficient counts. You can disable independent filtering in
compare.r(e.g.results(dds, independentFiltering=FALSE)), but interpret with caution.
- This may occur if a feature has zero variance across groups or insufficient counts. You can disable independent filtering in
- Symptom:
If none of the above resolve your issue, please open an issue on the repository and include:
- The exact command you ran
- A copy of the terminal output or error message
- Contents of
config/settings.sh,config/acc_id.txt, and (if relevant)config/comparisons.tsv
We welcome contributions and feedback. To collaborate:
- Report issues: Open a GitHub issue with a clear description and relevant details (error messages, commands, etc.).
- Propose features: Start a discussion or issue outlining your idea before coding.
- Submit pull requests:
- Fork the repo and create a branch.
- Reference related issues in your commits.
- Include tests or examples and update documentation.
- Follow style: Adhere to existing conventions and update README or comments as needed.
All contributions are reviewed—thank you for helping improve the pipeline!
This pipeline is released under the MIT License. See LICENSE for details.
If you use this workflow, please cite the GitHub repository:
“MPRA Pipeline: end-to-end processing, modeling, and comparison of MPRA data,” GitHub, https://github.com/FuxmanBass-lab/MPRA (accessed YYYY-MM-DD).
This pipeline is adapted from the MPRASuite by the Tewhey Lab:
Tewhey Lab MPRASuite MPRAmodel, GitHub, https://github.com/tewhey-lab/MPRASuite/tree/main/MPRAmodel (accessed YYYY-MM-DD).