differential-expression-analysis/index.Rmd

---
title: 'Differential Expression Analysis with limma-voom'
output:
  html_document: default
---

# I. Preliminaries
This tutorial consists of a workflow demostration for differential expression analyses using RNA-seq data using limma voom. While there are now many published methods for tackling specific steps, as well as full-blown pipelines, limma voom has  been show to be one of the top performers with respect to controlling the false discovery rate. In our daylong differential expression workshop, we also cover DE testing with sleuth, another top performer with respect to FDR control, and which is designed explicitly to handle psueduo-alignment based estimates of abundance derived from kallisto. Our limma voom workflow will use a gene-level count matrix derived from estimates of RNA abundance in each samples. These estimates have been generated with RSEM, using bowtie2 read alignments. Our limma voom workflow will use a gene-level count matrix derived from estimates of RNA abundance in each samples. These estimates have been generated with RSEM, using bowtie2 read alignments. These alignments were to transcripts derived from an annotated reference genome.

Specific topics covered today include:

* Importing and pre-processing a count matrix
* Using a sample table to define factors used in DE analysis
* Analysis of single-factor designs
* Analysis of 2-factor designs


## Sample data
Our sample data comprises 12 paired-end RNA-seq libraries for whole body samples of *Drosophila melanogaster* from two geographic regions (Panama and Maine), with two temperature treatments ("low" ane "high") for each region, featuring three biological replicates for each region x treatment combination. Previously, these data were used to look for parallel gene expression patterns between high and low latitude populations (Zhao et al, 2015, *PLoS Genetics*)

## Loading required R libraries
First, load all the R libraries that will be used for today's analyses:
```{r, echo=TRUE}
library(edgeR)
library(limma)
```

## Data management
1. Load and view the table that associates sample IDs and treatments (dme_elev_samples.tab):
```{r, echo=TRUE}
s2c<-read.table("data/dme_elev_samples.tab",header = TRUE, stringsAsFactors=FALSE)
s2c
```

2. Open RSEM matrix
```{r,echo=TRUE}
rsem_gene_data<-read.table("data/dme_elevgrad_rsem_bt2_gene_counts.matrix.bz2",header=TRUE,row.names=1)
```
## Pre-processing and filtering
### Handling non-integer RSEM estimates
3. Round the expression matrix
```{r,echo=TRUE}
rnaseqMatrix=round(rsem_gene_data)
```
### Filtering out lowly expressed genes
4. Create a boolean variable that classifies samples according to whether CPM>=1:
```{r,echo=TRUE}
filter=rowSums(cpm(rnaseqMatrix)>=1)>=6
```
5. Apply the filter to the expression matrix:
```{r,echo=TRUE}
cpm_filtered_matrix=rnaseqMatrix[filter,]
```
This operation filters out ~5000 genes, which reduces our multiple testing burden as well (although limma might not atually try and conduct tests on a subset of these).

## Creating a Digital Gene Expression list object 
To run limma, we need to transform the expression matrix into a DGElist ("digital gene expression list") which is an object class that comes from edgeR

6. Create the DGE object and normalized expression matrix:
```{r,echo=TRUE}
DGE<-DGEList(cpm_filtered_matrix)
```
## Normalization using TMM method

7. Calculate normalization factors and do MDS plot:
```{r,echo=TRUE}
DGE<-calcNormFactors(DGE,method =c("TMM"))
```
## Quick and dirty outlier check

8. Generate an MDS plot
```{r,echo=TRUE}
tempvals<-s2c$temp
plotMDS(DGE,top=500,col=ifelse(tempvals=="low","blue","red"),gene.selection="common")
```

## The vanilla one factor, two treatment DE analysis
A fundamental step of DE analysis is to construct a design matrix that will be used to fit the linear model. In our simple vanilla 2-condition, example we can simply use the temp variable in the s2c table to create a matrix. which represents binary 0/1 encodings of the temp conditions.

9. Create design matrix:
```{r,echo=TRUE}
design_temp=model.matrix(~temp, data=s2c)
design_temp
```
After creating the design matrix object, the standard approach is to next run limma voom on the DGE object, e.g.:
```{r,echo=TRUE,eval=FALSE}
v <- voom(DGE, design=design_temp, plot=TRUE)
```
**However** ... while this works fine under an ideal scenario, it becomes a problem if there is variation in sample quality, or more generally, there is some indication that a subset of samples appear as outliers via MDS, PCA, etc. Particularly for RNA-seq experiments where researchers may only have a few repicates per sample, discarding outlier samples is not feasible because it may lead to few if any biological replciates for some subset of treatments. 

A better solution to this problem is to apply weights to samples such that outlier samples are downweighted during differential expression calcuations. Limma voom does this by calculating "empirical quality weights" for each sample. 

10. Run limma voom with sample quality weights:
```{r,echo=TRUE}
vwts <- voomWithQualityWeights(DGE, design=design_temp,normalize.method="none", plot=TRUE)
```

**Note:** we have already applied TMM normalization, thus can set the normalization argument to none. This above command will also generate a plot with two panels showing the mean-variance relationship fit on the left, and a barplot of weights assigned to individual samples.

11. Then, run the linear model fitting procedure 1st step:
```{r,echo=TRUE}
fit=lmFit(vwts,design_temp)
```
12. Then apply the empirical bayes procedure:
```{r,echo=TRUE}
fit=eBayes(fit,robust=TRUE)
```
We use the robust=TRUE setting to leverage the quality weights such that the analysis is robust to outliers.

One can then get a quick and dirty summary of how many genes are differentially expressed, setting the FDR threshold,where the "fdr" and "BH" methods are synonymous for Benjamini Hochberg adjusted p-values.

13. Get summary table:
```{r,echo=TRUE}
summary(decideTests(fit,adjust.method="fdr",p.value = 0.05))
```
One piece of important info is the factor relative to which logfold change is being calculated, i.e. low will be the numerator for logfold change calculations.

14. Explore the data by extracting the top 10 DE genes (sorted by p-value):
```{r,echo=TRUE}
topTable(fit, adjust="BH",resort.by="P")
```
The full table will be useful for many purposes, such as creating custom MA or volcano plots with color-coding and symbols to meet your needs.
15. Create a table of all genes (significant and not)
```{r,echo=TRUE}
all_genes<-topTable(fit, adjust="BH",coef="templow", p.value=1, number=Inf ,resort.by="P")
```
coeff = the coefficient or contrast you want to extract  
number = the max number of genes to list  
adjust = the P value adjustment method  
resort.by determines what criteria with which to sort the table  

## Analysis of a 2-factor design
Extending limma to analyze more complex designs is relatively straightforward. A key part is to specify the design matrix properly. For the 2-factor design, one would do this as follows:
16. Construct the design matrix to incorporate temperate and population effects
```{r,echo=TRUE}
population <- factor(s2c$population)
temperature <- factor(s2c$temp, levels=c("low","high"))
design_2factor<- model.matrix(~population+temperature)
design_2factor
```
Then, you would proceed with DE analysis in a similar fashion as with the single factor experiment described above.