This is a pipeline to analyze next-generation sequencing of small RNAs in C. elegans. The pipeline can be broken down into two major parts:
-
Generate count matrices. Trims and maps reads to the C. elegans genome, then generates count matrices of the number of reads mapping antisense to each gene. This first part of the pipeline is designed to run in a high-performance computing cluster based on Linux and Slurm.
-
generate_count_matrices.sh
is the main file for Part 1 and calls all the other Part 1 scripts. On line 6 ofgenerate_count_matrices.sh
, specify the full pathname of your project directory:# Assign a variable to the pathname of the project (this is the main directory). Change this to fit your own path. main_dir=<project pathname>
-
Then execute lines 8-16 of
generate_count_matrices.sh
to generate the following directory structure:project name ├── logs ├── meta ├── raw_data ├── results └── scripts
-
Before continuing on with the rest of
generate_count_matrices.sh
, make sure that:- You've copied your raw, demultiplexed fastq files into the
raw_data
directory. - All Part 1 scripts are in the
scripts
directory. - Your metadata file
metadata.txt
is in themeta
directory. Column 1 ofmetadata.txt
must contain the desired output filename, and there must also be a column containing the input filename. Seemetadata.txt
in this repository for an example.
- You've copied your raw, demultiplexed fastq files into the
-
Note, this pipeline assumes the reads contain a 4-nucleotide-long barcode at the 5' end. If your reads do not contain a 5' barcode and instead begin immediately with the insert, make the following two changes:
-
Change line 36 in
select_5prime_barcode.sh
from:grep -B 1 -A 2 -e ^$barcode1 -e ^$barcode2 $input_path$input_file | sed '/^--/d' > $output_path$new_name
to:
cp $input_path$input_file $output_path$new_name
With this change, running
select_5prime_barcode.sh
will simply assign new, meaningful names to the fastq files usingmetadata.txt
and place them in a new directory inresults
calledsort_5prime
. -
Change line 23 in
trim_5prime.sh
from:cutadapt -u 4 -o $output $1 > ${2}/logs/trim5/${base}.txt
to:
cp $1 $output
With this change, running
trim_5prime.sh
will simply copy the fastq files into a new directory inresults
calledtrim3_trim5
and add "_trim5" to the end of each filename.
-
-
-
Differential analysis and visualization. Uses the count matrices generated in Part 1 to perform a simple wild type vs. mutant analysis to identify genes that are differentially targeted by small RNAs. This part of the pipeline is designed to run as an RStudio project (
DA_and_visualization.Rproj
).-
main_script.R
is the main file for Part 2. -
Before beginning, make sure the count matrices are in the
data
directory and that the metadata file (.csv format) is in themeta
directory. Examples of these files can be found in theexample_files
directory. -
Part 2 outputs include the following:
- A table of normalized counts (median of ratios method)
- A list of differentially targeted genes and their corresponding log2 fold changes and adjusted p-values
- A biplot of the top two principal components determined by principal component analysis
- A volcano plot of log2 fold change vs. significance, with labels for the top 10 significant genes
- The option to plot normalized counts for any given gene (specified by WormBase Gene ID)
-
Software | Version | Used in |
---|---|---|
gcc |
6.2.0 | Part 1: Generate count matrices |
python |
2.7.12 | Part 1: Generate count matrices |
cutadapt |
1.14 | Part 1: Generate count matrices |
fastqc |
0.11.5 | Part 1: Generate count matrices |
bowtie |
1.2.2 | Part 1: Generate count matrices |
samtools |
1.9 | Part 1: Generate count matrices |
deeptools |
3.0.2 | Part 1: Generate count matrices |
featureCounts |
2.0.0 | Part 1: Generate count matrices |
R |
3.5.1 | Part 2: Differential analysis and visualization |
DESeq2 |
1.22.2 | Part 2: Differential analysis and visualization |
tidyverse |
1.2.1 | Part 2: Differential analysis and visualization |
ggrepel |
0.8.1 | Part 2: Differential analysis and visualization |