Pipeline will intake raw counts file and run differential gene expression analysis using three different tools (Deseq2, EdgeR, and Limma_Voom) and perform Gene Set Enrichment Analysis using Deseq2 output.
Deseq2: From Deseq2, users can expect a comprehensive analysis of differential gene expression using DESeq2. The script takes as input raw gene counts, sample metadata, and contrasts between experimental conditions. It normalizes the gene counts, performs differential expression analysis, and outputs normalized counts, lists of differentially expressed genes (DEGs), and corresponding MA plots. The DEGs are filtered based on a log fold change threshold of 2 for upregulation and 0.5 for downregulation, with a significance threshold of p-value < 0.05. Additionally, heatmaps for the top 30 differentially expressed genes are generated to visually represent expression changes across the relevant samples. The code ensures that all outputs, including tables and plots, are saved to a specified directory for easy retrieval and downstream analysis.
EdgeR: The EdgeR code performs differential gene expression analysis by normalizing raw RNA-seq counts using TMM, accounting for library size differences, and estimating gene-specific dispersions. Users can expect the code to detect significantly differentially expressed genes between experimental groups using an exact test. It filters results based on an FDR threshold of 0.05 and log fold change thresholds of ≥ 2 or ≤ 0.5 (corresponding to a log2 fold change of ±1). The code generates visualizations such as MDS plots, MA plots, and heatmaps to help users explore gene expression patterns across samples.
Limma Voom: The Limma-voom code applies the voom transformation to RNA-seq data, stabilizing variance across gene expression levels and making the data suitable for linear modeling. Users can expect the code to fit linear models to the transformed data and perform empirical Bayes moderation to improve statistical accuracy. Contrasts between experimental groups are analyzed for differential expression, with results filtered based on an adjusted p-value threshold of 0.05 and a log fold change threshold of log2 fold change of ±1. The code generates MA plots and heatmaps to visualize the top differentially expressed genes, providing a detailed summary of expression changes.
Venn Diagram: This Venn diagram script takes filtered gene lists as input from multiple differential expression analyses, specifically from DESeq2, edgeR, and limma-voom. These gene lists are used to generate a Venn diagram, which visually represents the overlap of differentially expressed genes identified by each method. The script processes the input files containing filtered genes, ensures that the required GeneID column is present, and creates the diagram. The resulting Venn diagram is saved as a PNG file, providing a clear visualization of shared and unique genes between the analyses.
Gene Set Enrichment Anaysis: Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a pre-defined set of genes (ex: those beloging to a specific GO term or KEGG pathway) shows statistically significant, concordant differences between two biological states.
Clone the Repository: Clone the new repository to your local machine, choosing the directory where you want to perform data analysis. Instructions for cloning can be found here.
-Example of input RawCountFile_rsemgenes.txt
gene_id | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 |
---|---|---|---|---|---|---|
ENSG00000000003.14_TSPAN6 | 1500 | 1600 | 1450 | 1300 | 1250 | 1400 |
ENSG00000000005.6_TNMD | 10 | 12 | 9 | 15 | 14 | 18 |
ENSG00000000419.12_DPM1 | 3000 | 3100 | 2900 | 3200 | 3300 | 3400 |
ENSG00000000457.14_SCYL3 | 500 | 520 | 480 | 600 | 590 | 610 |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
ENSG00000005955.13_ACTN2 | 500 | 530 | 460 | 550 | 540 | 570 |
ENSG00000006062.15_RPS6 | 600 | 650 | 580 | 700 | 680 | 720 |
Add sample condition contrast pairs. Example:
parent | control |
---|
Add sample details keeping the same format such as:
samplename | condition | batch |
---|---|---|
Sample1 | parent | b1 |
Sample2 | parent | b2 |
Sample3 | parent | b2 |
Sample4 | control | b1 |
Sample5 | control | b2 |
Sample6 | control | b2 |
Give full path to raw_counts data
Give full path to results directory
Give full path to contrasts file
Select the reference genome of choice. For example: “mouse”
Select the organism_db accordingly. For example: "org.Mm.eg.db" for mouse
Select the kegg_organism. For example: "mmu" for mouse
Give full path to configfile
Give full path to Snakemake rule files
Give full path to configfile
deseq2.smk
pathway.smk
edger.smk
venn.smk
rmarkdown.smk
limma_voom.smk
module load snakemake/8.4.8
$NAME=my_environment_name
conda create -n $NAME
Activate the Conda Environment:
conda activate $NAME
Install mamba
conda install -c conda-forge mamba
Perform a dry-run to validate your setup:
snakemake --use-conda -np
Local Execution: Execute the workflow on your local machine using $N cores:
snakemake --use-conda --cores $N --latency-wait 55 --force
Here, $N represents the number of cores you wish to allocate for the workflow.
After successful execution, output analysis files along with rmarkdown html report will be generated in results folder.
Move the results folder to a different path and the workflow will be ready to run another dataset.
conda deactivate