-
Notifications
You must be signed in to change notification settings - Fork 297
/
Copy pathrsubread_test.Rmd
213 lines (160 loc) · 4.71 KB
/
rsubread_test.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "RNA-seq Pipeline: Rsubread, limma, and edgeR"
output:
html_document:
keep_md: yes
---
## Case study: using a Bioconductor R pipeline to analyze RNA-seq data
This example has been adapted from the work of:
Wei Shi (shi at wehi dot edu dot au), Yang Liao and Gordon K Smyth
Bioinformatics Division, Walter and Eliza Hall Institute, Melbourne, Australia
- Case: http://bioinf.wehi.edu.au/RNAseqCaseStudy/
- Code: http://bioinf.wehi.edu.au/RNAseqCaseStudy/code.R
- Data: http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz
Requirements:
- The version of Rsubread package should be 1.12.1 or later.
- You should run R version 3.0.2 or later.
## Record the start time
```{r}
print(Sys.time())
```
## Setup
> Load libraries, read in names of FASTQ files and create a design matrix.
### Set options
Set global knitr options.
```{r, cache=FALSE}
library("knitr")
opts_chunk$set(tidy=FALSE, cache=FALSE, messages=FALSE)
```
### Install packages
```{r, echo=TRUE, quietly=TRUE}
packages <- c("Rsubread", "limma", "edgeR")
source("http://bioconductor.org/biocLite.R")
for (pkg in packages)
{
require(pkg, character.only = TRUE, quietly = TRUE) || biocLite(pkg)
}
```
### Load libraries
```{r}
library(Rsubread)
library(limma)
library(edgeR)
print(Sys.time())
```
### Download data file
```{r}
# Create the data folder if it does not already exist
datadir <- "./Data/rsubread_test"
dir.create(file.path(datadir), showWarnings=FALSE, recursive=TRUE)
# Enter data folder, first saving location of current folder
projdir <- getwd()
setwd(datadir)
datadir <- getwd()
# Get the file
dataurl <- "http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz"
datafile <- "data.tar.gz"
if (! file.exists(datafile)) {
print("Downloading data file...")
download.file(dataurl, datafile) # 282.3 Mb
}
print(Sys.time())
```
### Extract data file
```{r}
setwd(datadir)
targetsfile <- "Targets.txt"
if (! file.exists(targetsfile)) {
print("Extracting data file...")
untar(datafile, tar="internal")
}
print(Sys.time())
```
### Read target file
```{r}
setwd(datadir)
options(digits=2)
targets <- readTargets(targetsfile)
targets
print(Sys.time())
```
### Create design matrix
```{r}
celltype <- factor(targets$CellType)
design <- model.matrix(~celltype)
print(Sys.time())
```
## Build reference
> Build an index for human chromosome 1. This typically takes ~3 minutes.
Index files with basename 'chr1' will be generated in your current working
directory.
```{r}
setwd(datadir)
buildindex(basename="chr1",reference="hg19_chr1.fa")
print(Sys.time())
```
## Align reads
> Perform read alignment for all four libraries and report uniquely mapped
reads only. This typically takes ~5 minutes. The generated SAM files, which
include the mapping results, will be saved in your current working directory.
```{r}
setwd(datadir)
align(index="chr1", readfile1=targets$InputFile,
input_format="gzFASTQ", output_format="BAM",
output_file=targets$OutputFile,
tieBreakHamming=TRUE, unique=TRUE, indels=5)
print(Sys.time())
```
## Summarize mapped reads
Count numbers of reads mapped to NCBI Refseq genes.
> Summarize mapped reads to RefSeq genes. This will just take a few seconds.
Note that the featureCounts function has built-in annotation for Refseq genes.
featureCounts returns an R 'List' object, which includes raw read count for
each gene in each library and also annotation information for genes such as
gene identifiers and gene lengths.
```{r}
setwd(datadir)
fc <- featureCounts(files=targets$OutputFile,annot.inbuilt="hg19")
x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
# Return to projdir
setwd(projdir)
print(Sys.time())
```
## Generate RPKM values
Generate RPKM values if you need them.
```{r}
x_rpkm <- rpkm(x,x$genes$Length)
print(Sys.time())
```
## Filter out low-count genes
> Only keep in the analysis those genes which had >10 reads per million mapped
reads in at least two libraries.
```{r}
isexpr <- rowSums(cpm(x) > 10) >= 2
x <- x[isexpr,]
print(Sys.time())
```
## Perform voom normalization
> The figure below shows the mean-variance relationship estimated by voom for
the data.
```{r}
y <- voom(x,design,plot=TRUE)
print(Sys.time())
```
## Cluster libraries
> The following multi-dimensional scaling plot shows that sample A libraries
are clearly separated from sample B libraries.
```{r}
plotMDS(y,xlim=c(-2.5,2.5))
print(Sys.time())
```
## Assess differential expression
Fit linear model and assess differential expression.
> Fit linear models to genes and assess differential expression using the
eBayes moderated t statistic. Here we list top 10 differentially expressed
genes between B vs A.
```{r}
fit <- eBayes(lmFit(y,design))
topTable(fit,coef=2)
print(Sys.time())
```