RNA-seq Analysis Pipeline (Nextflow + Salmon)

Overview

This project is an end-to-end, reproducible RNA-seq analysis pipeline built using Nextflow (DSL2). It automates the process from raw sequencing data retrieval to transcript-level quantification, with a strong emphasis on portability, reproducibility, and scalability.

The pipeline is designed to work seamlessly with both single-end and paired-end RNA-seq data, and can be executed locally, on HPC systems, or in the cloud using containerized environments.

NOTE: This pipeline is demonstrated using publicly available RNA-seq data from ENA. No controlled-access or sensitive data are included.

Key Features

Automated data retrieval from the European Nucleotide Archive (ENA)
Quality control using FastQC and MultiQC
Transcript-level quantification with Salmon
Robust handling of single-end and paired-end libraries
Docker support for full reproducibility
Modular DSL2 design for easy extension and maintenance

Pipeline Workflow

Input metadata parsing (ENA accessions, library layout, sample info)
FASTQ download from ENA
Quality control with FastQC
QC aggregation with MultiQC
Transcript quantification using Salmon
Structured outputs for downstream analysis (e.g., tximport / DESeq2)

Requirements

Software

Nextflow (>= 22.x)
Docker or Singularity/Apptainer

Containers

All tools are run inside containers:

FastQC
MultiQC
Salmon

No local installation of bioinformatics tools is required beyond Nextflow and a container engine.

Input

Sample Metadata

The pipeline expects a sample sheet (CSV or TSV) containing, at minimum:

sample_id
ena_run_accession
library_layout (SINGLE or PAIRED)

Example:

sample_id,ena_run_accession,library_layout
sampleA,ERR1234567,PAIRED
sampleB,ERR1234568,SINGLE

Reference Files

Transcriptome FASTA
Salmon index (generated beforehand or as a pipeline extension)

Usage

Run Locally with Docker

nextflow run main.nf \
  --samples samples.csv \
  --transcriptome transcripts.fa \
  -profile docker

Run on HPC (example)

nextflow run main.nf -profile slurm,singularity

Outputs

results/
├── fastqc/          # Per-sample FastQC reports
├── multiqc/         # Aggregated QC report
├── salmon/          # Transcript-level quantification
│   └── quant.sf
└── logs/

Key output file:

quant.sf – transcript-level abundance estimates suitable for tximport and downstream differential expression analysis.

Design Considerations

Single-end vs Paired-end Handling

The pipeline explicitly handles differences between library layouts:

Paired-end: _1 and _2 FASTQs passed to Salmon
Single-end: only _1 FASTQ used; read2 is optional and safely ignored

This avoids brittle logic and prevents downstream failures caused by missing or empty files.

Reproducibility

Containerized execution
Explicit versioning of tools
Deterministic pipeline structure via Nextflow DSL2

Extensibility

This pipeline is intentionally modular and can be extended to include:

Salmon index generation
DESeq2 differential expression analysis
Metadata-driven contrasts
Automated QC thresholding
Pathway and gene ontology enrichment

Use Cases

Academic RNA-seq analysis
Construct or condition comparison
Method development and benchmarking
Portfolio demonstration of reproducible bioinformatics workflows

Author Notes

This project was developed to demonstrate production-quality bioinformatics pipeline design featuring real-world usability.

It is suitable both as a research tool and as a portfolio artifact for computational biology and bioinformatics roles.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
mrna-screen-ngs-qc		mrna-screen-ngs-qc
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Analysis Pipeline (Nextflow + Salmon)

Overview

NOTE: This pipeline is demonstrated using publicly available RNA-seq data from ENA. No controlled-access or sensitive data are included.

Key Features

Pipeline Workflow

Requirements

Software

Containers

Input

Sample Metadata

Reference Files

Usage

Run Locally with Docker

Run on HPC (example)

Outputs

Design Considerations

Single-end vs Paired-end Handling

Reproducibility

Extensibility

Use Cases

Author Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

ikekpunobi/RNA-seq_pipeline

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Analysis Pipeline (Nextflow + Salmon)

Overview

NOTE: This pipeline is demonstrated using publicly available RNA-seq data from ENA. No controlled-access or sensitive data are included.

Key Features

Pipeline Workflow

Requirements

Software

Containers

Input

Sample Metadata

Reference Files

Usage

Run Locally with Docker

Run on HPC (example)

Outputs

Design Considerations

Single-end vs Paired-end Handling

Reproducibility

Extensibility

Use Cases

Author Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages