Skip to content

plantgenomicslab/PGRP_Snakemake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

153 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PGRP_Snakemake

Introduction

The PGRP_Snakemake pipeline provides an efficient and modular workflow for processing RNA sequencing data from multiple sources. The workflow provides users with a complete and distributed pipeline from downloading raw data to differential expression analysis. The pipeline can utilize raw sequencing reads directly from NCBI SRA or from local storage and is compatable with paired- or single-end libraries.

Pipeline overview:

  • Download fastq from SRA (SRA Toolkit)
  • Quality control on raw reads (FastQC)
  • Trimming (fastp)
  • Quality control on trimmed reads (FastQC)
  • Map reads to reference (STAR)
  • Count reads (RSEM/HTseq/FeatureCounts/TPMcalculator)
  • Normalize counts TPM/FPKM (custom scripts)
  • Summary statistics of normalized counts (custom scripts)
  • Differential expression analysis (Trinity/DESeq2)

Installation

Clone the repository:

git clone git@github.com:plantgenomicslab/PGRP_Snakemake.git

Dependencies

  • Trim Galore 0.6.7
  • SRA Toolkit 2.11.0
  • STAR
  • HTseq 1.99.2
  • Subread 2.0.1
  • multiqc 1.11
  • snakemake 7.24.0
  • parallel-fastq-dump 0.6.7
  • Samtools 1.14
  • ggplot2
  • Trinity 2.13.2
  • Graphviz
  • Gffread
  • TPMcalculator
  • Bioconductor Qvalue
  • RSEM
  • tabulate 0.8.10
  • fastp 0.23.2

Setting up a Conda environment

We recommend setting up a conda environment for PGRP_Snakemake to easily install the above dependencies. The latest version of Miniconda can be installed here. Use the commands below to set up the PGRP_Snakemake environment.

conda create -n PGRP_Snakemake -c bioconda -c conda-forge python=3.7 mamba

conda activate PGRP_Snakemake

mamba install -c bioconda -c conda-forge -c anaconda tabulate=0.8.10 trim-galore=0.6.7 sra-tools=2.11.0 STAR htseq=1.99.2 subread=2.0.1 multiqc=1.11 snakemake=7.24.0 parallel-fastq-dump=0.6.7 bioconductor-tximport samtools=1.14 r-ggplot2 trinity=2.13.2 hisat2 bioconductor-qvalue sambamba graphviz gffread tpmcalculator lxml rsem fastp=0.23.2

Configuration

Add config.json to the repo. This file controls various inputs to the workflow and must be updated by the user. A template for config.json is availble in the example/ directory.

cd ../PGRP_Snakemake
cp example/example_config.json config.json

Configure SRA Toolkit (only necessary if using SRA). The following commands allow you to specify where large SRA files get stored and ensure that your connection doesn't time out when downloading data from NCBI's SRA database.

# Set up the root dierctory for SRA files
# Enter '4' in interactive editor. Then enter your path.
vdb-config  -i --interactive-mode textual

# Add the new path to config.json under 'sra_dir'
#vim config.json

# Set timeout to 100 seconds
vdb-config -s /http/timeout/read=100000

Input Files

Reference genome

The pipeline requires an indexed reference genome and GTF file as input. To add a reference genome to the pipeline download fasta and GFF3 files from an appropriate source. Then:

# Make sure file is unzipped
gffread [GFF_file] -T -F --keep-exon-attrs -o [genome].gtf

# Update config.json with the relative path to the GTF file and the reference folder
#vim config.json

# Index the reference genome with STAR (make sure genome fasta is unzipped)
# Helps to run on a compute cluster (computationally expensive)
STAR  --runThreadN 48 \
  --runMode genomeGenerate \
  --genomeDir . \
  --genomeFastaFiles [genome.fa] \
  --sjdbGTFfile [genome.gtf] \
  --sjdbOverhang 99 \
  --genomeSAindexNbases 12

# If using RSEM, prepare the reference 
rsem-prepare-reference -p 48 --gtf [genome.gtf] [genome.fa] [rsem_prep]

Workflow control file

The pipeline requires the user to create a control file called RunsByExperiment.tsv. An examples of this file is provided in examples/.

RunsByExperiment.tsv provides the pipleline with information about the data that you want to process. These can be either SRA runs or locally stored data. When downloading SRA data, there may be multiple SRA runs (SRR...) for each SRA experiment (SRX...) where experiments represents the sequencing performed on a particular sample. Experiments should be given meainingful titles to aid in the interpretation of the pipeline output. Finally, the relationships between replicates and treatments should be flushed out for count aggregation and DEG analysis.

For local analysis scripts/Experiment_name_composer.py script takes an input folder and an optional output file name as command-line arguments. It generates a list of sample names from files with specific extensions in the input folder, removes any _rep* or _Rep* suffixes from the sample names, and creates an output file with columns for Run, Replicate, and Sample. The Run and Replicate columns contain the original sample name, while the Sample column contains the cleaned sample name without the _rep* or _Rep* suffix.

# Example format of RunsbyExperiment.tsv

Run	Experiment	Replicate	Treatment
SRR5210841	SRX2524297	ZT0_rep1	ZT0
SRR5210842	SRX2524297	ZT0_rep1	ZT0
SRR5210843	SRX2524297	ZT0_rep1	ZT0
SRR5210844	SRX2524298	ZT4_rep1	ZT4
SRR5210845	SRX2524298	ZT4_rep1	ZT4
SRR5210846	SRX2524298	ZT4_rep1	ZT4
SRR5210847	SRX2524299	ZT8_rep1	ZT8
SRR5210848	SRX2524299	ZT8_rep1	ZT8
SRR5210849	SRX2524299	ZT8_rep1	ZT8

Because these files can tedious to generate for projects with many samples, a script called joinSraRelations.py is provided. This script takes an SRA project id (SRP...) as input and fetches the associated run information over the SRA API. The final two inputs are regex expressions to parse out the human readable treatment/replicate text from the full SRA run titles.

# Example usage for SRA project SRP098160
# Full run titles for this project take the form: 'GSM2471308: ZT0_rep1; Glycine max; RNA-Seq'
# Experiment titles will be of the form 'ZT0_rep1'
./scripts/joinSraRelations.py SRP098160 "ZT\d{1,2}_rep\d" "ZT\d{1,2}"

# To not deal with regex and fix manually in text editor
./scripts/joinSraRelations.py SRP098160 ".*" ".*"

If running locally, the 'Run' and 'Experiment' fields of RunsbyExperiment.tsv may be omitted. The 'Replicate' field should represent the prefixes of all fastq files to be included in the analysis. 'Treatment' then should be the desired name of the treatment to which each replicate belongs. Make sure to update config.json with the nomenclature for paired-ends.

DE control files

Another optional control file called replication_relationship.txt must be created if differential expression analysis is desired. If the file is missing then DE analysis will not be performed. replication_relationship.txt provides tab separated contrasts to be used in differential gene expression analysis using DESeq2. The first column defines treatments and the second column defines replicates associated with each treatment. Pairwise differentially expressed genes will be computed for each combination of treatments. See example/ for an example:

# Example format of replication_relationship.txt
cat deg_samples.txt

ZT0	ZT0_rep1
ZT0	ZT0_rep2
ZT0	ZT0_rep3
ZT4	ZT4_rep1
ZT4	ZT4_rep2
ZT4	ZT4_rep3
ZT8	ZT8_rep1
ZT8	ZT8_rep2
ZT8	ZT8_rep3

Running the pipeline

Running without scheduler

# Check the pipeline prior to run
snakemake --snakefile Snakefile -np

# Visualize the pipeline as a DAG
snakemake --snakefile Snakefile \
          --dag [output] | \
          dot -Tpdf -Gnodesep=0.75 -Granksep=0.75 > dag.pdf

# Run the pipeline 
snakemake --snakefile Snakefile \
          --cores [available cores]

Running with SLURM scheduler

sbatch --mem=4g \
       -c 2 \
       --time=13-11:00:00 \
       -o snakemake.out \
       -e snakemake.err \
       --wrap="./run.sh"

error checking

Slurm accounting for both jobs

sacct -j 5396911,5396912 --format=JobID,State,ExitCode,Elapsed,NodeList,MaxRSS,MaxVMSize%20

Full Slurm job records (helpful if they were OOM-killed or preempted)

scontrol show job -dd 5396911
scontrol show job -dd 5396912

Tool logs from your rules

tail -n +200 output/AgteqT02_rep2/AgteqT02_rep2/logs/AgteqT02_rep2_raw_fastqc.log
tail -n +200 output/AgteqT12_rep1/AgteqT12_rep1/logs/AgteqT12_rep1_trim.log

Nextflow

nextflow run main.nf -params-file params.yaml -c resources-slurm.config -resume -with-report report.html -with-trace trace.txt -with-timeline timeline.html -with-dag flowchart.png

Citations

  • Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. doi:https://doi.org/10.14806/ej.17.1.200
  • Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21, https://doi.org/10.1093/bioinformatics/bts635
  • Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.
  • Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014
  • Vera Alvarez R, Pongor LS, Mariño-Ramírez L, Landsman D. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics. 2019 Jun 1;35(11):1960-1962. doi: 10.1093/bioinformatics/bty896. PMID: 30379987; PMCID: PMC6546121.
  • Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics (Oxford, England), 31(2), 166–169. https://doi.org/10.1093/bioinformatics/btu638
  • Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). https://doi.org/10.1186/s13059-014-0550-8

About

PGRP_Snakemake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors