forked from Bioconductor/BiocWorkshops
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path100_Morgan_RBiocForAll.Rmd
752 lines (506 loc) · 31 KB
/
100_Morgan_RBiocForAll.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
# 100: _R_ and _Bioconductor_ for everyone: an introduction
Authors:
Martin Morgan^[Roswell Park Comprehensive Cancer Center, Buffalo, NY],
Lori Shepherd, Nitesh Turaga.
<br/>
Last modified: 14 September 2018
## Overview
### Description
This workshop is intended for those with little or no experience using
_R_ or _Bioconductor_. In the first portion of the workshop, we will
explore the basics of using _RStudio_, essential _R_ data types,
composing short scripts and using functions, and installing and using
packages that extend base _R_ functionality. The second portion of the
workshop orients participants to the _Bioconductor_ collection of _R_
packages for analysis and comprehension of high-throughput genomic
data. We will describe how to discover, install, and learn to use
_Bioconductor_ packages, and will explore some of the unique ways in
which _Bioconductor_ represents genomic data. The workshop will
primarily be instructor-led live demos, with participants following
along in their own _RStudio_ sessions.
### Pre-requisites
This workshop is meant to be introductory, and has no pre-requisites.
Participants will maximize benefit from the workshop by pursuing
elementary instructions for using _R_, such as the
[introductory course][] from DataCamp.
[introductory course]: https://www.datacamp.com/courses/free-introduction-to-r
### Participation
Participants will use _RStudio_ interactively as the instructor moves
carefully through short examples of _R_ and _Bioconductor_ code.
### _R_ / _Bioconductor_ packages used
- Base _R_ packages, e.g., `stats`, `graphics`; [ggplot2][]
- Essential _Bioconductor_ packages, e.g., [rtracklayer][], [GenomicRanges][], [SummarizedExperiment][]
[ggplot2]: https://cran.r-project.org/package=ggplot2
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[rtracklayer]: https://bioconductor.org/packages/rtracklayer
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
### Time outline
| Activity | Time |
|:---------------------------------|:-----|
| Introduction to _R_ | 45m |
| - Using _RStudio_ | |
| - _R_ vectors and data frames | |
| - Data input and manipulation | |
| - Scripts and functions | |
| - Using _R_ packages | |
| Introduction to _Bioconductor_ | 60m |
| - Project history | |
| - Discovering and using packages | |
| - Working with objects | |
### Workshop goals and objectives
### Learning goals
Part 1: _R_
- Import text (e.g., 'comma-separate value') files, into _R_.
- Perform manipulations of _R_ data frames.
- Apply _R_ functions for statistical analysis and visualization.
- Use _R_ packages to extend basic functionality.
Part 2: _Bioconductor_
- Find, install, and learn how to use _Bioconductor_ packages.
- Import and manipulate genomic files and _Bioconductor_ data objects.
- Start an RNA-seq differential expression work flow.
### Learning objectives
Part 1: _R_
- Import the 'pData.csv' file describing samples from an experiment.
- 'Clean' the pData to include only a subset of values.
- Perform a t-test assessing the relationship between 'age' and
'mol.biol' (presence of BCR/ABL) amongst samples.
- Visualize the relationship between 'age' and 'mol.biol' using base
_R_'s `boxplot()` function.
- Load the [ggplot2][] package.
- Visualize the relationship between two variables using `ggplot()`.
Part 2: _Bioconductor_
- Discover, install, and read the vignette of the [DESeq2][] package.
- Discover the 'single cell sequencing' vignette
- Import BED and GTF files into _Bioconductor_
- Find regions of overlap between the BED and GTF files.
- Import a matrix and data.frame into _Bioconductor_'s
`SummarizedExperiment` object for RNA-Seq differential expression.
[DESeq2]: https://bioconductor.org/packages/DESeq2
## Introduction to _R_
### _RStudio_ orientation

### Very basics
_R_ works by typing into the console. A very simple example is to add the numbers 1 and 2.
```{r}
1 + 2
```
_R_ works best on _vectors_, and we can construct vectors 'by hand' using the function `c()` (perhaps for _c_oncatenate). Here is a vector of the numbers 1, 3, and 5.
```{r}
c(1, 3, 5)
```
_R_ has many shortcuts. A very commonly used shortcut is to create a vector representing a sequence of numbers using `:`. For instance, here is a vector representing the numbers 1 through 5.
```{r}
1:5
```
It would be pretty tedious to continually have to enter vectors by hand. _R_ allows objects such as the vector returned by `c()` to be assigned to variables. The variables can then be referenced at subsequent points in the script.
```{r}
x <- c(1, 3, 5)
x
```
Since _R_ knows about vectors, it can be very easy to transform all elements of the vector, e.g., taking the `log()` of `x`;
```{r}
log(x)
```
The return value of `log(x)` is itself a vector, and can be assigned to another variable.
```{r}
y <- log(x)
y
```
### Data input and manipulation
**Read data files into _R_ `data.frame` objects**. We start by reading a simple file into _R_. The file is a 'csv' (comma-separated value) file that could have been exported from a spreadsheet. The first few lines of the file include:
```
"Sample","sex","age","mol.biol"
"01005","M",53,"BCR/ABL"
"01010","M",19,"NEG"
"03002","F",52,"BCR/ABL"
"04006","M",38,"ALL1/AF4"
"04007","M",57,"NEG"
```
The file describes samples used in a classic microarray experiment. It consists of four columns:
- `Sample`: a unique identifier
- `sex`: the sex of each patient
- `age`: the age of each patient
- `mol.biol`: cytological characterization of each patient, e.g.,
`"BCR/ABL"` indicates presence of the classic BCR/ABL translocation,
while `"NEG"` indicates no special cytological information.
We start by asking are to find the _path_ to the file, using a function `file.choose()`. This will open a dialog box, and we will navigate to the location of a file named `"ALL-phenoData.csv"`. Print the value of `fname` to view the path that you choose.
```{r, eval = FALSE}
fname <- file.choose()
fname
```
```{r, echo = FALSE}
fname <- "./100_Morgan_RBiocForAll/ALL-phenoData.csv"
```
Confirm that the file exists using the `file.exists()` command.
```{r}
file.exists(fname)
```
Read the data from the csv file using `read.csv()`; assign the input to a variable called `pdata`.
```{r}
pdata <- read.csv(fname)
```
**Viewing the `data.frame`**
We could print the entire data frame to the screen using
```{r, eval = FALSE}
pdata
```
but a smarter way to explore the data is to ask about its dimensions (rows and columns) using `dim()`, to look at the `head()` and `tail()` of the data, and to ask _R_ for a brief `summary()`.
```{r}
dim(pdata) # 128 rows x 4 columns
head(pdata) # first six rows
tail(pdata) # last six rows
summary(pdata)
```
The `age` element of `pdata` is a vector of integers and the summary provides a quantitative description of the data. Some values are missing, and these have the special value `NA`.
The `sex` and `mol.biol` vectors are _factors_. The `sex` variable has two _levels_ (`M`, `F`) while the `mol.biol` vector has 6 levels. A factor is a statistical concept central to describing data; the levels describe the universe of possible values that the variable can take.
`Sample` is also a `factor`, but it should probably be considered a `character` vector used to identify each sample; we clean this up in just a minute.
In addition to `summary()`, it can be helpful to use `class()` to look at the class of each object or column, e.g.,
```{r}
class(fname)
class(pdata)
```
**Reading data, round two**
We noted that `Sample` has been read by _R_ as a factor. Consult the (complicated!) help page for `read.csv()` and try to understand how to use `colClasses` to specify how each column of data should be input.
To navigate to the help page, use
```{r, eval=FALSE}
?read.csv
```
Focus on the `colClasses` argument. It is a vector that describes the class each column should be interpretted as. We'd like to read the columns as character, factor, integer, factor. Start by creating a vector of desired values
```{r}
c("character", "factor", "integer", "factor")
```
Then add another argument to our use of `read.csv()`, specifying the desired column classes as this vector.
```{r}
pdata <- read.csv(
fname,
colClasses = c("character", "factor", "integer", "factor")
)
summary(pdata)
```
**Subsetting**
A basic operation in _R_ is to subset data. A `data.frame` is a two-dimensional object, so it is subset by specifying the desired rows and columns. Several different ways of specifying rows and colums are possible. Row selection often involves numeric vectors, logical vectors, and character vectors (selecting the rows with the corresponding `rownames()`). Column selection is most often with a character vector, but can also be numeric.
```{r}
pdata[1:5, c("sex", "mol.biol")]
pdata[1:5, c(2, 3)]
```
Omitting either the row or column index selects all rows or columns
```{r}
pdata[1:5, ]
```
The subset operator `[` generally returns the same type of object as being subset, e.g., above all return values are `data.frame` objects. It is sometimes desirable to select a single column as a vector. This can be done using `$` or `[[`; note that some values are `NA`, indicating that the patient's age is "not available".
```{r}
pdata$age
pdata[["age"]]
```
We can use `class()` to determine the type of each column
```{r}
class(pdata$age)
```
as well as other functions to manipulate or summarize the data. What do each of the following lines do?
```{r}
table(pdata$mol.biol)
table(is.na(pdata$age))
levels(pdata$sex)
```
**Basic data manipulations**
We've seen (e.g., `log()`, above) that numeric vectors can be transformed by mathematical functions. Similar functions are available for other data types. A very common scenario is to create logical vectors that are then used to subset objects. Here we compare each element of the `sex` column to `"F"`. The result indicates which rows in the `pdata` correspond to indivdudals with sex `"F"`.
```{r}
pdata$sex == "F"
```
A more elaborate example identifies rows corresponding to females greater than 50 years of age. Note that some comparisons are `NA`; why is that?
```{r}
(pdata$sex == "F") & (pdata$age > 50)
```
Another elaborate comparison is `%in%`, which asks whether each element in the vector on the left-hand side of `%in%` is contained in the set of values on the right-hand side, for example, which elements of the `pdata$mol.biol` column are in the set `"BCR/ABL"` or `"NEG"`?
```{r}
table( pdata$mol.biol )
pdata$mol.biol %in% c("BCR/ABL", "NEG")
```
The `subset()` function can be used to perform the subset, e.g., selecting all records of females older than 50
```{r}
subset(pdata, sex == "F" & age > 50)
```
The return value of `subset()` can be assigned to a variable, so the below creates a subset of `pdata` corresponding to indivdiuals whose `mol.biol` is either `"BCR/ABL"` or `"NEG"`; the result is assigned to a variable `bcrabl`.
```{r}
bcrabl <- subset(pdata, mol.biol %in% c("BCR/ABL", "NEG"))
dim( bcrabl )
```
Here's a tabular summary of the levels of the `mol.biol` factor in `bcrabl`
```{r}
table(bcrabl$mol.biol)
```
Note that the factor includes levels that are not actually present in the subset. For our work below, we'd like to remove the unused levels. This can be done by calling the `factor()` function
```{r}
factor(bcrabl$mol.biol)
```
We can then update the `mol.biol` column of `bcrabl` by assigning new values to it with `<-`.
```{r}
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
table(bcrabl$mol.biol)
```
**Exploratory data analysis**
A very useful concept in _R_ is the `formula`. This is specified in a form like `lhs ~ rhs`. The left-hand side is typically a response (dependent) variable, and the right-hand side a description of the independent variables that one is interested in. I 'say' a formula like `age ~ mol.biol` as "age as a function of mol.biol".
To get a sense of the use of formulas in _R_, and the ease with which one can explore data, use the `boxplot()` function to visualize the relationship between age as a function of molecular biology.
```{r}
boxplot(age ~ mol.biol, bcrabl)
```
Consult the help page `?boxplot` to figure out how to add "Age" as the y-axis label.
The boxplot suggests that in our sample individuals with BCR/ABL are on average older than individuals classified as NEG. Confirm this using `t.test()`.
```{r}
t.test(age ~ mol.biol, bcrabl)
```
It might be surprising that the `df` (degrees of freedom) are not a round number, but 68.592. This is because by default _R_ performs a t-test that allows for variances to differ between groups. Consult the `?t.test` help page to figure out to perform a simpler form of the t-test, where variances are equal.
### Writing _R_ scripts
### Using packages
_R_ functions are made available in _packages_. All the functions that we have used so far are implemented in packages that are distributed with _R_. There are actually more than 15,000 _R_ packages available for a wide variety of purposes. The general challenge we explore here is how to discover, install, and use one of these packages.
**Discover**
The most common place to find _R_ packages in on the comprehensive _R_ archive network, CRAN.
Visit CRAN at [https://cran.r-project.org][], click on the 'Packages' link, and the table of available packages sorted by name. Find the [ggplot2][] package and peruse the information you are presented with. What does this package do?
The large number of _R_ packages can make finding the 'best' packages for a particular purpose difficult to identify. From [https://cran.r-project.org][], click on the 'Task Views' link and explore a topic interesting to you.
[https://cran.r-project.org]: https://cran.r-project.org
**Install (once only per _R_ installation)**
One useful packages have been identified, they need to be installed into _R_. This only needs to be done once per _R_ installation. The following installs the [ggplot2][] and [tibble][] packages; you DO NOT need to do this command because [ggplot2][] is already installed on your AMI.
```{r, eval = FALSE}
## No need to do this...
install.packages(c("ggplot2", "tibble"))
```
[tibble]: https://cran.r-project.org/package=tibble
**Use**
Packages that are installed on your system are not available for use until they are attached to the current _R_ session. Use `library()` to attach the [ggplot2][] package.
```{r}
library("ggplot2")
```
Look up help on `?ggplot`. Peruse especially the examples at the bottom of the help page. The idea is that one identifies the data set to be used, and the variables with the data to be used as the x and y 'aesthetics'
```{r, eval=FALSE}
ggplot(brcabl, aes(x = mol.biol, y = age))
```
One then adds 'geoms' to the plot, for instance the boxplot geom
```{r}
ggplot(bcrabl, aes(x = mol.biol, y = age)) + geom_boxplot()
```
Use `xlab()` and `ylab()` to enhance the plot with more meaning x and y labels; see the help page `?xlab`, especially the example section, for hints on how to modify the graph.
The grammar of graphics implemented by [ggplot2][] is a very powerful and expressive system for producing appealing visualizations, and is widely used in the _R_ community.
## Introduction to _Bioconductor_
_Bioconductor_ is a collection of more than 1,500 packages for the statistical analysis and comprehension of high-throughput genomic data. Originally developed for microarrays, _Bioconductor_ packages are now used in a wide range of analyses, including bulk and single-cell RNA-seq, ChIP seq, copy number analysis, microarray methylation and classic expression analysis, flow cytometry, and many other domains.
This section of the workshop introduces the essential of _Bioconductor_ package discovery, installation, and use.
### Discovering, installing, and learning how to use _Bioconductor_ packages
**Discovery**
The web site at https://bioconductor.org contains descriptions of all _Bioconductor_ packages, as well as essential reference material for all levels of user.
Packages available in _Bioconductor_ are summarized at [https://bioconductor.org/packages][], also linked from the front page of the web site. The widget on the left summarizes four distinct types of _Bioconductor_ packages
- 'Software' packages implement particular analyses or other functionality, e.g., querying web-based resources or importing common file formats to _Bioconductor_ objects.
- 'Annotation' packages contain data that can be helpful in placing analysis results in context, for example: mapping between gene symbols such as "BRCA1" and Ensembl or Entrez gene identifiers; classifying genes in terms of gene ontology; describing the genomic coordinates of exons, transcripts, and genes; and representing whole genome sequences of common organisms.
- 'Experiment data' packages contain highly curated data sets that can be useful for learning and teaching (e.g., the [airway][] package and data set used in the [DESeq2][] package for the analysis of bulk RNA-seq differential expression) or placing results in context (e.g., the [curatedTCGAData][] package for conveniently accessing TCGA data in a way that allows very smooth integeration into _Bioconductor_ analyses).
- 'Workflow' packages that summarize common work flows, e.g., [simpleSingleCell][] for single-cell RNA-seq expression analysis.
[airway]: https://bioconductor.org/packages/airway
[DESeq2]: https://bioconductor.org/packages/DESeq2
[curatedTCGAData]: https://bioconductor.org/packages/curatedTCGAData
[simpleSingleCell]: https://bioconductor.org/packages/simpleSingleCell
**Installation**
Like CRAN packages, _Bioconductor_ packages need to be installed only once per _R_ installation, and then attached to each session where they are going to be used.
_Bioconductor_ packages are installed slightly differently from CRAN packages. The first step is to install the [BiocManager][] package from CRAN.
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos="https://cran.r-project.org")
```
The next step is to install the desired _Bioconductor_ packages. The syntax to install the [rtracklayer][] and [GenomicRanges][] packages is
```{r, eval = FALSE}
BiocManager::install(c("rtracklayer", "GenomicRanges"))
```
_Bioconductor_ packages tend to depend on one another quite alot, so it is important that the correct versions of all packages are installed. Validate your installation (not necessary during the course) with
```{r, eval = FALSE}
BiocManager::valid()
```
A convenient function in [BiocManager][] is `available()`, which accepts a regular expression to find matching packages. The following finds all 'TxDb' packages (describing exon, transcript, and gene coordinates) for _Homo sapiens_.
```{r}
BiocManager::available("TxDb.Hsapiens")
```
[BiocManager]: https://cran.r-project.org/package=BiocManager
**Learning and support**
Each package is linked to a 'landing page', e.g., [DESeq2][] that contains a description of the package, authors, perhaps literature citations where the software is described, and installation instructions.
An important part of _Bioconductor_ packages are 'vignettes' which describe how the package is to be used. Vignettes are linked from package landing pages, and are available from within _R_ using
```{r, eval = FALSE}
browseVignettes("simpleSingleCell")
```
Users can get support on using packages at https://support.bioconductor.org, a question-and-answer style forum where response usually come quickly and often from very knowledgable users or the package developer. There are many additional sources of support, including [course material][] linked from the home page.
[https://bioconductor.org/packages]: https://bioconductor.org/packages
[work flows]: https://bioconductor.org/packages/release/BiocViews.html#___Workflow
[course material]: https://bioconductor.org/help/course-materials
### Working with Genomic Ranges
This section introduces two useful packages for general-purpose work on genomic coordinates. The [rtracklayer][] package provides the `import()` function to read many types of genomic files (e.g., BED, GTF, VCF, FASTA) into _Bioconductor_ objects. The [GenomicRanges][] package provides functions for manipulating genomic ranges, i.e., descriptions of exons, genes, ChIP peaks, called variants, ... as coordinates in genome space.
Start by attaching the [rtracklayer][] and [GenomicRanges][] packages to our session.
```{r, message=FALSE}
library("rtracklayer")
library("GenomicRanges")
```
**Importing data**
We'll read in a BED file derived from the UCSC genome browser. The file contains the coordinates of all CpG islands in the human genome, and is described at the [UCSC table browser][]. Here are the first few lines of the file, giving the chromosme, start and end coordinates, and identifier of each CpG island.
[UCSC table browser]: https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=578954849_wF1QP81SIHdfr8b0kmZUOcsZcHYr&clade=mammal&org=Human&db=hg38&hgta_group=regulation&hgta_track=knownGene&hgta_table=0&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=primaryTable&hgta_outFileName=
```
chr1 155188536 155192004 CpG:_361
chr1 2226773 2229734 CpG:_366
chr1 36306229 36307408 CpG:_110
chr1 47708822 47710847 CpG:_164
chr1 53737729 53739637 CpG:_221
chr1 144179071 144179313 CpG:_20
```
Use `file.choose()` to find the file
```{r, eval = FALSE}
fname <- file.choose() # CpGislands.Hsapiens.hg38.UCSC.bed
```
```{r, echo = FALSE}
fname <- "100_Morgan_RBiocForAll/CpGislands.Hsapiens.hg38.UCSC.bed"
```
```{r}
fname
file.exists(fname)
```
Then use `import()` from [rtracklayer][] to read the data into _R_. The end result is a `GenomicRanges` object describing each CpG island.
```{r}
cpg <- import(fname)
cpg
```
Closely compare the coordinates of the first few ranges from the file with the first few ranges in the _Bioconductor_ representation. The [BED format][] specification says that coordinates are 0-based, and intervals are half-open (the 'start' coordinate is in the range, the 'end' coordinate is immediately after the range; this makes some computations easy). _Bioconductor_'s convention is that coordinates are 1-based and closed (i.e., both start and end coordinates are included in the range). [rtracklayer][]'s `import()` function has adjusted coordinates to follow _Bioconductor_ conventions.
[BED format]: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
**Working with genomic ranges**
For convenience and to illustrate functionality, let's work only with the 'standard' chromosomes 1 - 22 autosomal, X, and Y chromosomes. Look up the help page `?keepStandardChromosomes` for an explanation of `pruning.mode=`.
```{r}
cpg <- keepStandardChromosomes(cpg, pruning.mode = "coarse")
cpg
```
There are two parts to a `GenomicRanges` object. The `seqnames` (chromosomes, in the present case), start and end coordinates, and strand are _required_. Additional elements such as `name` in the current example are optional.
Required components are accessed by functions such as `start()`, `end()` and `width()`. Optional components can be accessed using the `$` notation.
```{r}
head( start(cpg) )
head( cpg$name )
```
Use the `width()` accessor function to extract a vector of widths of each CpG island. Transform the values using `log10()`, and visualize the distribution using `hist()`.
```{r}
hist(log10(width(cpg)))
```
Use `subset()` to select the CpG islands on chromosomes 1 and 2.
```{r}
subset(cpg, seqnames %in% c("chr1", "chr2"))
```
**Genomic annotations**
Earlier we mentioned 'Annotation data' packages. An example is the TxDb.* family of packages. These packages contain information on the genomic coordinates of exons, genes, transcripts, etc. Attach the TxDb package corresponding to the _Homo sapiens_ hg38 genome build using the UCSC 'knownGene' track.
```{r, message=FALSE}
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
```
Extract the coordinates of all transcripts
```{r}
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx
```
Keep only the standard chromosomes, to work smoothly with our `cpg` object.
```{r}
tx <- keepStandardChromosomes(tx, pruning.mode="coarse")
tx
```
**Overlaps**
A very useful operation is to count overlaps in two distinct genomic ranges objects. The following counts the number of CpG islands that overlap each transcript. Related functions include `findOverlaps()`, `nearest()`, `precede()`, and `follow()`.
```{r}
olaps <- countOverlaps(tx, cpg)
length(olaps) # 1 count per transcript
table(olaps)
```
Calculations such as `countOverlaps()` can be added to the `GRanges` object, tightly coupling derived data with the original annotation.
```{r}
tx$cpgOverlaps <- countOverlaps(tx, cpg)
tx
```
It is then possible to perform coordinated actions, e.g., subsetting the `GRanges` objects for transcripts satisfying particular conditions, in a coordinated fashion where the software does all the book-keeping to makes sure the correct ranges are selected.
```{r}
subset(tx, cpgOverlaps > 10)
````
Can you think of other situations where one might calculate derived values and couple these with `GRanges` or similar objects?
### Working with summarized experimental data
This section introduces another broadly useful package and data structure, the [SummarizedExperiment][] package and `SummarizedExperiment` object.

The `SummarizedExperiment` object has matrix-like properties -- it has two dimensions and can be subset by 'rows' and 'columns'. The `assay()` data of a `SummarizedExperiment` experiment contains one or more matrix-like objects where rows represent features of interest (e.g., genes), columns represent samples, and elements of the matrix represent results of a genomic assay (e.g., counts of reads overlaps genes in each sample of an bulk RNA-seq differential expression assay.
**Object construction**
The `SummarizedExperiment` coordinates assays with (optional) descriptions of rows and columns. We start by reading in a simple `data.frame` describing 8 samples from an RNASeq experiment looking at dexamethasone treatment across 4 human smooth muscle cell lines; use `browseVignettes("airway")` for a more complete description of the experiment and data processing. Read the column data in using `file.choose()` and `read.csv()`.
```{r, eval=FALSE}
fname <- file.choose() # airway_colData.csv
fname
```
```{r, echo=FALSE}
fname <- "100_Morgan_RBiocForAll/airway_colData.csv"
```
We want the first column the the data to be treated as row names (sample identifiers) in the `data.frame`, so `read.csv()` has an extra argument to indicate this.
```{r}
colData <- read.csv(fname, row.names = 1)
colData
```
The data are from the Short Read Archive, and the row names, `SampleName`, `Run`, `Experiment`, `Sampel`, and `BioSample` columns are classifications from the archive. Additional columns include:
- `cell`: the cell line used. There are four cell lines.
- `dex`: whether the sample was untreated, or treated with dexamethasone.
- `albut`: a second treatment, which we ignore
- `avgLength`: the sample-specific average length of the RNAseq reads estimated in the experiment.
**Assay data**
Now import the assay data from the file "airway_counts.csv"
```{r, eval=FALSE}
fname <- file.choose() # airway_counts.csv
fname
```
```{r, echo=FALSE}
fname <- "100_Morgan_RBiocForAll/airway_counts.csv"
```
```{r}
counts <- read.csv(fname, row.names=1)
```
Although the data are read as a `data.frame`, all columns are of the same type (integer-valued) and represent the same attribute; the data is really a `matrix` rather than `data.frame`, so we coerce to matrix using `as.matrix()`.
```{r}
counts <- as.matrix(counts)
```
We see the dimensions and first few rows of the counts matrix
```{r}
dim(counts)
head(counts)
```
It's interesting to think about what the counts mean -- for ENSG00000000003, sample SRR1039508 had 679 reads that overlapped this gene, sample SRR1039509 had 448 reads, etc. Notice that for this gene there seems to be a consistent pattern -- within a cell line, the read counts in the untreated group are always larger than the read counts for the treated group. This and other basic observations from 'looking at' the data motivate many steps in a rigorous RNASeq differential expression analysis.
**Creating a `SummarizedExperiment` object**
We saw earlier that there was considerable value in tightly coupling the count of CpG islands overlapping each transcript with the `GRanges` describing the transcripts. We can anticipate that close coupling of the column data with the assay data will have similar benefits, e.g., reducing the chances of bookkeeping errors as we work with our data.
Attach the [SummarizedExperiment][] library to our _R_ session.
```{r, message=FALSE}
library("SummarizedExperiment")
```
Use the `SummarizedExperiment()` function to coordinate the assay and column data; this function uses row and column names to make sure the correct assay columns are desribed by the correct column data rows.
```{r}
se <- SummarizedExperiment(assay = list(count=counts), colData = colData)
se
```
It is straight-forward to use `subset()` on `SummarizedExperiment` to create subsets of the data in a coordinated way. Remember that a `SummarizedExperiment` is conceptually two-dimensional (matrix-like), and in the example below we are subsetting on the second dimension.
```{r}
subset(se, , dex == "trt")
```
As with `GRanges`, there are accessors that extract data from the `SummarizedExperiment`. For instance, we can use `assay()` to extract the count matrix, and `colSums()` to calculate the library size (total number of reads overlapping genes in each sample).
```{r}
colSums(assay(se))
```
Note that library sizes differ by a factor of 2 from largest to smallest; how would this influence the interpretation of counts in individual cells of the assay data?
As with `GRanges`, it might be useful to remember important computations in a way that is robust, e.g.,
```{r}
se$lib.size <- colSums(assay(se))
colData(se)
```
### Down-stream analysis
In this final section we quickly hint at down-stream analysis, and the way in which skills learned in working with _Bioconductor_ objects in one package translate to working with objects in other packages.
We start by loading the [DESeq2][] package, a very popular facility for analysing bulk RNAseq differential experssion data.
```{r, message=FALSE}
library("DESeq2")
```
The package requires count data like that in the `SummarizedExperiment` we have been working with, in addition to a `formula` describing the experimental design. Some of the observations above suggest that we should include cell line as a covariate, and dexamethazone treatment as the main factor that we are interested in.
```{r}
dds <- DESeqDataSet(se, design = ~ cell + dex)
dds
```
The `dds` object can be manipulated very much like a `SummarizedExperiment`.
The essention DESeq work flow is summarized by a single function call, which performs advanced statistical analysis on the data in the `dds` object.
```{r}
dds <- DESeq(dds)
```
A table summarizing measures of differential expression can be extracted from the object, and visualized or manipulated using commands we learned earlier today.
```{r}
results(dds)
```
## Summary