C. elegans small RNA-seq analysis

This is a pipeline to analyze next-generation sequencing of small RNAs in C. elegans. The pipeline can be broken down into two major parts:

Generate count matrices. Trims and maps reads to the C. elegans genome, then generates count matrices of the number of reads mapping antisense to each gene. This first part of the pipeline is designed to run in a high-performance computing cluster based on Linux and Slurm.
- generate_count_matrices.sh is the main file for Part 1 and calls all the other Part 1 scripts. On line 6 of generate_count_matrices.sh, specify the full pathname of your project directory:
```
 # Assign a variable to the pathname of the project (this is the main directory). Change this to fit your own path.
 main_dir=<project pathname>
```
- Then execute lines 8-16 of generate_count_matrices.sh to generate the following directory structure:
```
 project name
   ├── logs
   ├── meta
   ├── raw_data
   ├── results
   └── scripts
```
- Before continuing on with the rest of generate_count_matrices.sh, make sure that:
  - You've copied your raw, demultiplexed fastq files into the raw_data directory.
  - All Part 1 scripts are in the scripts directory.
  - Your metadata file metadata.txt is in the meta directory. Column 1 of metadata.txt must contain the desired output filename, and there must also be a column containing the input filename. See metadata.txt in this repository for an example.
- Note, this pipeline assumes the reads contain a 4-nucleotide-long barcode at the 5' end. If your reads do not contain a 5' barcode and instead begin immediately with the insert, make the following two changes:
  - Change line 36 in select_5prime_barcode.sh from:
```
 grep -B 1 -A 2 -e ^$barcode1 -e ^$barcode2 $input_path$input_file | sed '/^--/d' > $output_path$new_name
```
    to:
```
 cp $input_path$input_file $output_path$new_name
```
    With this change, running select_5prime_barcode.sh will simply assign new, meaningful names to the fastq files using metadata.txt and place them in a new directory in results called sort_5prime.
  - Change line 23 in trim_5prime.sh from:
```
 cutadapt -u 4 -o $output $1 > ${2}/logs/trim5/${base}.txt
```
    to:
```
 cp $1 $output
```
    With this change, running trim_5prime.sh will simply copy the fastq files into a new directory in results called trim3_trim5 and add "_trim5" to the end of each filename.
Differential analysis and visualization. Uses the count matrices generated in Part 1 to perform a simple wild type vs. mutant analysis to identify genes that are differentially targeted by small RNAs. This part of the pipeline is designed to run as an RStudio project (DA_and_visualization.Rproj).
- main_script.R is the main file for Part 2.
- Before beginning, make sure the count matrices are in the data directory and that the metadata file (.csv format) is in the meta directory. Examples of these files can be found in the example_files directory.
- Part 2 outputs include the following:
  - A table of normalized counts (median of ratios method)
  - A list of differentially targeted genes and their corresponding log2 fold changes and adjusted p-values
  - A biplot of the top two principal components determined by principal component analysis
  - A volcano plot of log2 fold change vs. significance, with labels for the top 10 significant genes
  - The option to plot normalized counts for any given gene (specified by WormBase Gene ID)

Software requirements

Software	Version	Used in
`gcc`	6.2.0	Part 1: Generate count matrices
`python`	2.7.12	Part 1: Generate count matrices
`cutadapt`	1.14	Part 1: Generate count matrices
`fastqc`	0.11.5	Part 1: Generate count matrices
`bowtie`	1.2.2	Part 1: Generate count matrices
`samtools`	1.9	Part 1: Generate count matrices
`deeptools`	3.0.2	Part 1: Generate count matrices
`featureCounts`	2.0.0	Part 1: Generate count matrices
`R`	3.5.1	Part 2: Differential analysis and visualization
`DESeq2`	1.22.2	Part 2: Differential analysis and visualization
`tidyverse`	1.2.1	Part 2: Differential analysis and visualization
`ggrepel`	0.8.1	Part 2: Differential analysis and visualization

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DA_and_visualization		DA_and_visualization
generate_count_matrices		generate_count_matrices
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C. elegans small RNA-seq analysis

Software requirements

About

Releases

Packages

Languages

License

annedodson/smallRNA-seq

Folders and files

Latest commit

History

Repository files navigation

C. elegans small RNA-seq analysis

Software requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages