The PGRP_Snakemake pipeline provides an efficient and modular workflow for processing RNA sequencing data from multiple sources. The workflow provides users with a complete and distributed pipeline from downloading raw data to differential expression analysis. The pipeline can utilize raw sequencing reads directly from NCBI SRA or from local storage and is compatable with paired- or single-end libraries.
Pipeline overview:
- Download fastq from SRA (SRA Toolkit)
- Quality control on raw reads (FastQC)
- Trimming (fastp)
- Quality control on trimmed reads (FastQC)
- Map reads to reference (STAR)
- Count reads (RSEM/HTseq/FeatureCounts/TPMcalculator)
- Normalize counts TPM/FPKM (custom scripts)
- Summary statistics of normalized counts (custom scripts)
- Differential expression analysis (Trinity/DESeq2)
Clone the repository:
git clone git@github.com:plantgenomicslab/PGRP_Snakemake.git- Trim Galore 0.6.7
- SRA Toolkit 2.11.0
- STAR
- HTseq 1.99.2
- Subread 2.0.1
- multiqc 1.11
- snakemake 7.24.0
- parallel-fastq-dump 0.6.7
- Samtools 1.14
- ggplot2
- Trinity 2.13.2
- Graphviz
- Gffread
- TPMcalculator
- Bioconductor Qvalue
- RSEM
- tabulate 0.8.10
- fastp 0.23.2
We recommend setting up a conda environment for PGRP_Snakemake to easily install the above dependencies. The latest version of Miniconda can be installed here. Use the commands below to set up the PGRP_Snakemake environment.
conda create -n PGRP_Snakemake -c bioconda -c conda-forge python=3.7 mamba
conda activate PGRP_Snakemake
mamba install -c bioconda -c conda-forge -c anaconda tabulate=0.8.10 trim-galore=0.6.7 sra-tools=2.11.0 STAR htseq=1.99.2 subread=2.0.1 multiqc=1.11 snakemake=7.24.0 parallel-fastq-dump=0.6.7 bioconductor-tximport samtools=1.14 r-ggplot2 trinity=2.13.2 hisat2 bioconductor-qvalue sambamba graphviz gffread tpmcalculator lxml rsem fastp=0.23.2
Add config.json to the repo. This file controls various inputs to the workflow and must be updated by the user. A template for config.json is availble in the example/ directory.
cd ../PGRP_Snakemake
cp example/example_config.json config.json
Configure SRA Toolkit (only necessary if using SRA). The following commands allow you to specify where large SRA files get stored and ensure that your connection doesn't time out when downloading data from NCBI's SRA database.
# Set up the root dierctory for SRA files
# Enter '4' in interactive editor. Then enter your path.
vdb-config -i --interactive-mode textual
# Add the new path to config.json under 'sra_dir'
#vim config.json
# Set timeout to 100 seconds
vdb-config -s /http/timeout/read=100000The pipeline requires an indexed reference genome and GTF file as input. To add a reference genome to the pipeline download fasta and GFF3 files from an appropriate source. Then:
# Make sure file is unzipped
gffread [GFF_file] -T -F --keep-exon-attrs -o [genome].gtf
# Update config.json with the relative path to the GTF file and the reference folder
#vim config.json
# Index the reference genome with STAR (make sure genome fasta is unzipped)
# Helps to run on a compute cluster (computationally expensive)
STAR --runThreadN 48 \
--runMode genomeGenerate \
--genomeDir . \
--genomeFastaFiles [genome.fa] \
--sjdbGTFfile [genome.gtf] \
--sjdbOverhang 99 \
--genomeSAindexNbases 12
# If using RSEM, prepare the reference
rsem-prepare-reference -p 48 --gtf [genome.gtf] [genome.fa] [rsem_prep]The pipeline requires the user to create a control file called RunsByExperiment.tsv. An examples of this file is provided in examples/.
RunsByExperiment.tsv provides the pipleline with information about the data that you want to process. These can be either SRA runs or locally stored data. When downloading SRA data, there may be multiple SRA runs (SRR...) for each SRA experiment (SRX...) where experiments represents the sequencing performed on a particular sample. Experiments should be given meainingful titles to aid in the interpretation of the pipeline output. Finally, the relationships between replicates and treatments should be flushed out for count aggregation and DEG analysis.
For local analysis scripts/Experiment_name_composer.py script takes an input folder and an optional output file name as command-line arguments. It generates a list of sample names from files with specific extensions in the input folder, removes any _rep* or _Rep* suffixes from the sample names, and creates an output file with columns for Run, Replicate, and Sample. The Run and Replicate columns contain the original sample name, while the Sample column contains the cleaned sample name without the _rep* or _Rep* suffix.
# Example format of RunsbyExperiment.tsv
Run Experiment Replicate Treatment
SRR5210841 SRX2524297 ZT0_rep1 ZT0
SRR5210842 SRX2524297 ZT0_rep1 ZT0
SRR5210843 SRX2524297 ZT0_rep1 ZT0
SRR5210844 SRX2524298 ZT4_rep1 ZT4
SRR5210845 SRX2524298 ZT4_rep1 ZT4
SRR5210846 SRX2524298 ZT4_rep1 ZT4
SRR5210847 SRX2524299 ZT8_rep1 ZT8
SRR5210848 SRX2524299 ZT8_rep1 ZT8
SRR5210849 SRX2524299 ZT8_rep1 ZT8
Because these files can tedious to generate for projects with many samples, a script called joinSraRelations.py is provided. This script takes an SRA project id (SRP...) as input and fetches the associated run information over the SRA API. The final two inputs are regex expressions to parse out the human readable treatment/replicate text from the full SRA run titles.
# Example usage for SRA project SRP098160
# Full run titles for this project take the form: 'GSM2471308: ZT0_rep1; Glycine max; RNA-Seq'
# Experiment titles will be of the form 'ZT0_rep1'
./scripts/joinSraRelations.py SRP098160 "ZT\d{1,2}_rep\d" "ZT\d{1,2}"
# To not deal with regex and fix manually in text editor
./scripts/joinSraRelations.py SRP098160 ".*" ".*"If running locally, the 'Run' and 'Experiment' fields of RunsbyExperiment.tsv may be omitted. The 'Replicate' field should represent the prefixes of all fastq files to be included in the analysis. 'Treatment' then should be the desired name of the treatment to which each replicate belongs. Make sure to update config.json with the nomenclature for paired-ends.
Another optional control file called replication_relationship.txt must be created if differential expression analysis is desired. If the file is missing then DE analysis will not be performed. replication_relationship.txt provides tab separated contrasts to be used in differential gene expression analysis using DESeq2. The first column defines treatments and the second column defines replicates associated with each treatment. Pairwise differentially expressed genes will be computed for each combination of treatments. See example/ for an example:
# Example format of replication_relationship.txt
cat deg_samples.txt
ZT0 ZT0_rep1
ZT0 ZT0_rep2
ZT0 ZT0_rep3
ZT4 ZT4_rep1
ZT4 ZT4_rep2
ZT4 ZT4_rep3
ZT8 ZT8_rep1
ZT8 ZT8_rep2
ZT8 ZT8_rep3
# Check the pipeline prior to run
snakemake --snakefile Snakefile -np
# Visualize the pipeline as a DAG
snakemake --snakefile Snakefile \
--dag [output] | \
dot -Tpdf -Gnodesep=0.75 -Granksep=0.75 > dag.pdf
# Run the pipeline
snakemake --snakefile Snakefile \
--cores [available cores]sbatch --mem=4g \
-c 2 \
--time=13-11:00:00 \
-o snakemake.out \
-e snakemake.err \
--wrap="./run.sh"sacct -j 5396911,5396912 --format=JobID,State,ExitCode,Elapsed,NodeList,MaxRSS,MaxVMSize%20
scontrol show job -dd 5396911
scontrol show job -dd 5396912
tail -n +200 output/AgteqT02_rep2/AgteqT02_rep2/logs/AgteqT02_rep2_raw_fastqc.log
tail -n +200 output/AgteqT12_rep1/AgteqT12_rep1/logs/AgteqT12_rep1_trim.log
nextflow run main.nf -params-file params.yaml -c resources-slurm.config -resume -with-report report.html -with-trace trace.txt -with-timeline timeline.html -with-dag flowchart.png
- Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. doi:https://doi.org/10.14806/ej.17.1.200
- Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
- Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.
- Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014
- Vera Alvarez R, Pongor LS, Mariño-Ramírez L, Landsman D. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics. 2019 Jun 1;35(11):1960-1962. doi: 10.1093/bioinformatics/bty896. PMID: 30379987; PMCID: PMC6546121.
- Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics (Oxford, England), 31(2), 166–169. https://doi.org/10.1093/bioinformatics/btu638
- Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). https://doi.org/10.1186/s13059-014-0550-8