forked from Bioconductor/BiocWorkshops
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path240_Lee_Plyranges.Rmd
779 lines (618 loc) · 28 KB
/
240_Lee_Plyranges.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
# 240: Fluent genomic data analysis with plyranges
## Instructor names and contact information
* Stuart Lee (lee.s@wehi.edu.au)
* Michael Lawrence (michafla@gmail.com)
## Workshop Description
In this workshop, we will give an overview of how to perform low-level
analyses of genomic data using the grammar of genomic data transformation
defined in the plyranges package. We will cover:
- introduction to GRanges
- overview of the core verbs for arithmetic, restriction, and aggregation of GRanges objects
- performing joins between GRanges objects
- designing pipelines to quickly explore data from AnnotationHub
- reading BAM and other file types as GRanges objects
The workshop will be a computer lab, in which the participants will be able to
ask questions and interact with the instructors.
### Pre-requisites
This workshop is self-contained however familiarity with the following would be useful:
- plyranges [vignette](https://sa-lee.github.io/plyranges/articles/an-introduction.html)
- the GenomicRanges and IRanges packages
- tidyverse approaches to data analysis
### Workshop Participation
Students will work through an Rmarkdown document while the instructors respond
to any questions they have.
### _R_ / _Bioconductor_ packages used
- plyranges
- airway
- AnnotationHub
- GenomicRanges
- IRanges
- S4Vectors
### Time outline
| Activity | Time |
|------------------------------|------|
| Overview of GRanges | 5m |
| The plyranges grammar | 20m |
| I/O and data pipelines | 20m |
## Workshop goals and objectives
### Learning goals
* Understand that GRanges follows tidy data principles
* Apply the plyranges grammar to genomic data analysis
### Learning objectives
* Use AnnotationHub to find and summarise data
* Read files into R as GRanges objects
* Perform coverage analysis
* Build data pipelines for analysis based on GRanges
## Workshop
```{r setup, echo=FALSE}
knitr::opts_chunk$set(message = FALSE,
warning = FALSE,
fig.width = 5,
fig.height = 3,
fig.align = "center",
fig.path = "./Lee_Plyranges/")
```
## Introduction
### What is plyranges?
The plyranges package is a domain specific language (DSL) built on top of the
IRanges and GenomicRanges packages (@Lee2018; @Lawrence2013-wg).
It is designed to quickly and coherently
analyse genomic data in the form of GRanges objects (more on those later!) and
from a wide variety of genomic data file types. For users who are familiar with
the tidyverse, the grammar that plyranges implements will look familiar but with
a few modifications for genomic specific tasks.
### Why use plyranges?
The grammar that plyranges develops is helpful for reasoning
about genomics data analysis, and provides a way of developing
short readable analysis pipelines. We have tried to emphasise consistency
and code readability by following the design principles outlined by @Green1996-qg.
One of the goals of plyranges is to provide an alternative entry point
to analysing genomics data with Bioconductor, especially for R beginners and
R users who are more familiar with the tidyverse approach to data analysis.
As a result, we have de-emphasised the use of more complicated data structures
provided by core Bioconductor packages that are useful for programming with.
### Who is this workshop for?
This workshop is intended for new users of Bioconductor, users who are
interested to learn about grammar based approaches for data analysis,
and users who are interested in learning how to use R to perform analyses
like those available in the command line packages BEDTools (@Quinlan2010-gc).
If that's you, let's begin!
## Setup
To participate in this workshop you'll need to have R >= 3.5 and install
the plyranges, AnnotationHub, and airway Bioconductor 3.7 packages
(@R-AnnotationHub; @R-airway). You can achieve this
by installing the BiocManager package from CRAN, loading it then running the
install command:
```{r, eval = FALSE}
install.packages("BiocManager")
library(BiocManager)
install(c("plyranges", "AnnotationHub", "airway"))
```
## What are GRanges objects?
```{r GRanges, echo = FALSE, fig.cap="An illustration of a GRanges data object for a single sample from an RNA-seq experiment. The core components of the object include a seqname column (representing the chromosome), a ranges column which consists of start and end coordinates for a genomic region, and a strand identifier (either positive, negative, or unstranded). Metadata are included as columns to the right of the dotted line as annotations (gene-id) or range level covariates (score).", out.width="\\textwidth"}
knitr::include_graphics("./Lee_Plyranges/GRanges.png")
```
The plyranges package is built on the core Bioconductor data structure
GRanges. It is very similar to the base R data.frame but with appropriate
semantics for a genomics experiment: it has fixed columns
for the chromosome, start and end coordinates, and the strand, along
with an arbitrary set of additional columns, consisting of
measurements or metadata specific to the data type or experiment
(figure \@ref(fig:GRanges)).
GRanges balances flexibility with formal
constraints, so that it is applicable to virtually any genomic
workflow, while also being semantically rich enough to support
high-level operations on genomic ranges. As a core data structure,
GRanges enables compatibility between plyranges and the rest of
Bioconductor.
Since a GRanges object is similar to a data.frame, we can use plyranges
to construct a GRanges object from a data.frame. We'll start by supposing we
have a data.frame of genes from the yeast genome:
```{r}
library(plyranges, quietly = TRUE)
genes <- data.frame(seqnames = "VI",
start = c(3322, 3030, 1437, 5066, 6426, 836),
end = c(3846, 3338, 2615, 5521, 7565, 1363),
strand = c("-", "-", "-", "+", "+", "+"),
gene_id=c("YFL064C", "YFL065C", "YFL066C",
"YFL063W", "YFL062W", "YFL067W"),
stringsAsFactors = FALSE)
gr <- as_granges(genes)
gr
```
The `as_granges` method takes a data.frame and allows you to quickly convert
it to a GRanges object (and can you can also specify which columns in the
data.frame correspond to the columns in the GRanges).
A GRanges object follows tidy data principles: it is a rectangular
table corresponding to a single biological context (@Wickham2014-jc).
Each row contains
a single observation and each column is a variable describing the
observations. In the example above, each row corresponds to a single gene, and
each column contains information about those genes. As GRanges are tidy,
we have constructed plyranges to follow and extend the grammar in the R package
dplyr.
## The Grammar
Here we provide a quick overview of the functions available in plyranges
and illustrate their use with some toy examples (see
\@ref(appendix) for an overview). In the final section we
provide two worked examples (with exercises) that show you can use
plyranges to explore publicly available genomics data and perform coverage
analysis of BAM files.
### Core verbs
The plyranges grammar is simply a set of verbs that define actions to
be performed on a GRanges (for a complete list see the appendix).
Verbs can be composed together using the pipe
operator, `%>%`, which can be read as 'then'. Here's a simple pipeline: first
we will add two columns, one corresponding to the gene_type and another with
the GC content (which we make up by drawing from a uniform distribution).
Second we will remove genes if they have a width less than 400bp.
```{r}
set.seed(2018-07-28)
gr2 <- gr %>%
mutate(gene_type = "ORF",
gc_content = runif(n())) %>%
filter(width > 400)
gr2
```
The `mutate()` function is used to add columns, here we've added one column
called gene_type where all values are set to "ORF" (standing for open reading
frame) and another called gc_content with random uniform values. The `n()`
operator returns the number of ranges in GRanges object, but can only
be evaluated inside of one of the plyranges verbs.
The `filter()` operation returns ranges if the expression evaluates to TRUE.
Multiple expressions can be composed together and will be evaluated as `&`
```{r}
gr2 %>%
filter(strand == "+", gc_content > 0.5)
# is the same as using `&`
gr2 %>%
filter(strand == "+" & gc_content > 0.5)
# but different from using or '|'
gr2 %>%
filter(strand == "+" | gc_content > 0.5)
```
Now that we have some measurements over our genes, we are most likely
interested in performing the favourite tasks of a biological data scientist:
taking averages and counting. This is achieved with the `summarise()` verb
which will return a DataFrame object (Why is this the case?).
```{r}
gr2 %>%
summarise(avg_gc = mean(gc_content),
n = n())
```
which isn't very exciting when performed without `summarise()`'s best
friend the `group_by()` operator:
```{r}
gr2 %>%
group_by(strand) %>%
summarise(avg_gc = mean(gc_content),
n = n())
```
The `group_by()` operator causes `plyranges` verbs to behave differently.
Instead of acting on all the ranges in a GRanges object, the verbs act within
each group of ranges defined by the values in the grouping column(s).
The `group_by()` operator does not change the appearance the of a GRanges object
(well for the most part):
```{r}
by_strand <- gr2 %>%
group_by(strand)
by_strand
```
Now any verb we apply to our grouped GRanges, acts on each partition:
```{r}
by_strand %>%
filter(n() > 2)
by_strand %>%
mutate(avg_gc_strand = mean(gc_content))
```
To remove grouping use the `ungroup()` verb:
```{r}
by_strand %>%
ungroup()
```
Finally, metadata columns can be selected using the `select()` verb:
```{r}
gr2 %>%
select(gene_id, gene_type)
# is the same as not selecting gc_content
gr2 %>%
select(-gc_content)
# you can also select by metadata column index
gr2 %>%
select(1:2)
```
### Verbs specific to GRanges
We have seen how you can perform restriction and aggregation on GRanges,
but what about specific actions for genomics data analysis, like arithmetic,
nearest neighbours or finding overlaps?
#### Arithmetic
Arithmetic operations transform range coordinates, as defined by their
_start_, _end_ and _width_ columns. These three variables are mutually
dependent so we have to take care when modifying them.
For example, changing the _width_ column needs to change
either the _start_, _end_ or both to preserve integrity of the
GRanges:
```{r}
# by default setting width will fix the starting coordinate
gr %>%
mutate(width = width + 1)
```
We introduce the `anchor_direction()` operator to clarify these
modifications. Supported anchor points include the start `anchor_start()`,
end `anchor_end()` and midpoint `anchor_center()`:
```{r}
gr %>%
anchor_end() %>%
mutate(width = width * 2)
gr %>%
anchor_center() %>%
mutate(width = width * 2)
```
Note that the anchoring modifier introduces an important departure from the
IRanges and GenomicRanges packages: by default we ignore the strandedness of
a GRanges object, and instead we provide verbs that make stranded actions
explicit. In this case of anchoring we provide the `anchor_3p()` and
`anchor_5p()` to perform anchoring on the 3' and 5' ends of a range.
The table in the appendix provides a complete list of arithmetic options
available in plyranges.
#### Genomic aggregation
There are two verbs that can be used to aggregate over nearest neighbours:
`reduce_ranges()` and `disjoin_ranges()`.
The reduce verb merges overlapping and neighbouring ranges:
```{r}
gr %>% reduce_ranges()
```
We could find out which genes are overlapping each other by aggregating
over the gene_id column and storing the result in a List column:
```{r}
gr %>%
reduce_ranges(gene_id = List(gene_id))
```
The disjoin verb takes the union of end points over all ranges, and results
in an expanded range
```{r}
gr %>%
disjoin_ranges(gene_id = List(gene_id))
```
You may have noticed that the resulting range are now unstranded, to take
into account stranded features use the _directed_ prefix.
#### Overlaps
Another important class of operations for genomics data analysis is finding
overlaps or nearest neighbours. Here's where we will
introduce the `plyranges` join operators and the overlap aggregation
operators.
Now let's now suppose we have some additional measurements that we have
obtained from a new experiment on yeast. These measurements are for three different
replicates and represent single nucleotide or insertion deletion intensities
from an array. Our
collaborator has given us three different data.frames with the data but they
all have names inconsistent with the GRanges data structure. Our goal is to
unify these into a single GRanges object with column for each measurement
and column for the sample identifier.
Here's our data:
```{r}
set.seed(66+105+111+99+49+56)
pos <- sample(1:10000, size = 100)
size <- sample(1:3, size = 100, replace = TRUE)
rep1 <- data.frame(chr = "VI",
pos = pos,
size = size,
X = rnorm(100, mean = 2),
Y = rnorm(100, mean = 1))
rep2 <- data.frame(chrom = "VI",
st = pos,
width = size,
X = rnorm(100, mean = 0.5, sd = 3),
Y = rnorm(100, sd = 2))
rep3 <- data.frame(chromosome = "VI",
start = pos,
width = size,
X = rnorm(100, mean = 2, sd = 3),
Y = rnorm(100, mean = 4, sd = 0.5))
```
For each replicate we want to construct a GRanges object:
```{r}
# we can tell as_granges which columns in the data.frame
# are the seqnames, and range coordinates
rep1 <- as_granges(rep1, seqnames = chr, start = pos, width = size)
rep2 <- as_granges(rep2, seqnames = chrom, start = st)
rep3 <- as_granges(rep3, seqnames = chromosome)
```
And to construct our final GRanges object we can bind all our replicates
together:
```{r}
intensities <- bind_ranges(rep1, rep2, rep3, .id = "replicate")
# sort by the starting coordinate
arrange(intensities, start)
```
Now we would like to filter our positions if they overlap one of the genes
we have, one way we could achieve this is with the `filter_by_overlaps()`
operator:
```{r}
olap <- filter_by_overlaps(intensities, gr)
olap
```
Another option would be to perform a join operation. A join
acts on two GRanges objects, a query and a subject. The join operator
retains the metadata from the query and subject
ranges. All join operators generate a set of hits based on overlap or
proximity of ranges and use those hits to merge the two datasets in different
ways (figure \ref{fig:olaps-fig}, provides an overview of the overlap joins).
We can further restrict the matching by whether the query is completely _within_
the subject, and adding the _directed_ suffix ensures that
matching ranges have the same direction (strand).
```{r olaps-fig, echo = FALSE, out.width="\\textwidth", fig.cap="Illustration of the three overlap join operators. Each join takes query and subject range as input (black and light gray rectangles, respectively). An index for the join is computed, returning a Hits object, which contains the indices of where the subject overlaps the query range. This index is used to expand the query ranges by where it was 'hit' by the subject ranges. The join semantics alter what is returned: for an \\textbf{inner} join the query range is returned for each match, for an \\textbf{intersect} the intersection is taken between overlapping ranges, and for a \\textbf{left} join all query ranges are returned even if the subject range does not overlap them. This principle is gnerally applied through the plyranges package for both overlaps and nearest neighbour operations."}
knitr::include_graphics("./Lee_Plyranges/olap.png")
```
Going back to our example, the overlap inner join will return all intensities
that overlap a gene and propagate which gene_id a given intensity belongs to:
```{r}
olap <- join_overlap_inner(intensities, gr)
```
If we wanted to return all intensities regardless of overlap we could use
the left join, and a missing value will be propagated to the gene_id
column if there isn't any overlap.
```{r}
join_overlap_left(intensities, gr)
```
If we are interested in finding how much overlap there is we could use
an intersect join. For example we could compute the fraction of overlap
of the genes with themselves:
```{r}
gr %>%
mutate(gene_length = width) %>%
join_overlap_intersect(gr, suffix = c(".query", ".subject")) %>%
filter(gene_id.query != gene_id.subject) %>%
mutate(folap = width / gene_length)
```
### Exercises
There are of course many more functions available in plyranges but hopefully
the previous examples are sufficient to give you an idea of how to incorporate
plyranges into your genomic workflows.
Here's a few exercises to help you familiar with the verbs:
1. Find the average intensity of the X and Y measurements for each each replicate
over all positions in the intensities object
1. Add a new column to the intensities object that is the distance from
each position to its closest gene (hint IRanges::distance)
1. Find flanking regions downstream of the genes in gr that have width of 8bp
(hint: type: flank_ and see what comes up!)
1. Are any of the intensities positions within the flanking region? (use an
overlap join to find out!)
## Data import and creating pipelines
Here we provide some realistic examples of using plyranges to import genomic
data and perform exploratory analyses.
### Worked example: exploring BigWig files from AnnotationHub
In the workflow of ChIP-seq data analysis, we are often interested in
finding peaks from islands of coverage over a chromosome. Here we will use plyranges
to explore ChiP-seq data from the Human Epigenome Roadmap project @Roadmap-Epigenomics-Consortium2015-pr.
#### Extracting data from AnnotationHub
This data is available on Bioconductor's AnnotationHub. First we construct
an AnnotationHub, and then `query()` for all bigWigFiles related to
the project that correspond to the following conditions:
1. are from methylation marks (H3K4ME in the title)
2. correspond to primary T CD8+ memory cells from peripheral blood
3. correspond to unimputed log10 P-values
First we construct a hub that contains all references to the EpigenomeRoadMap
data and extract the metadata as a data.frame:
```{r}
library(AnnotationHub)
library(magrittr)
ah <- AnnotationHub()
roadmap_hub <- ah %>%
query("EpigenomeRoadMap")
# extract2 is just a call to `[[`
metadata <- ah %>%
query("Metadata") %>%
extract2(names(.))
head(metadata)
```
To find out the name of the sample corresponding to
primary memory T-cells we can filter the data.frame. We extract the
sample ID corresponding to our filter.
```{r}
primary_tcells <- metadata %>%
filter(ANATOMY == "BLOOD") %>%
filter(TYPE == "PrimaryCell") %>%
filter(EDACC_NAME == "CD8_Memory_Primary_Cells") %>%
extract2("EID") %>%
as.character()
primary_tcells
```
Now we can take our roadmap hub and query it based on our other conditions:
```{r}
methylation_files <- roadmap_hub %>%
query("BigWig") %>%
query(primary_tcells) %>%
query("H3K4ME[1-3]") %>%
query("pval.signal")
methylation_files
```
So we'll take the first two entries and download them as BigWigFiles:
```{r}
bw_files <- lapply(c("AH33454", "AH33455"), function(id) ah[[id]])
names(bw_files) <- c("HK34ME1", "HK34ME3")
```
We have our desired BigWig files so now we can we can start analysing them.
#### Reading BigWig files
For this analysis, we will call peaks from islands of scores
over chromosome 10.
First, we extract the genome information from the first BigWig file and filter
to get the range for chromosome 10. This range will be used as a filter when
reading the file.
```{r load-bw}
chr10_ranges <- bw_files %>%
extract2(1L) %>%
get_genome_info() %>%
filter(seqnames == "chr10")
```
Then we read the BigWig file only extracting scores if they overlap chromosome
10. We also add the genome build information to the resulting ranges. This
book-keeping is good practice as it ensures the integrity of any
downstream operations such as finding overlaps.
```{r}
# a function to read our bigwig files
read_chr10_scores <- function(file) {
read_bigwig(file, overlap_ranges = chr10_ranges) %>%
set_genome_info(genome = "hg19")
}
# apply the function to each file
chr10_scores <- lapply(bw_files, read_chr10_scores)
# bind the ranges to a single GRAnges object
# and add a column to identify the ranges by signal type
chr10_scores <- bind_ranges(chr10_scores, .id = "signal_type")
chr10_scores
```
Since we want to call peaks over each signal type, we will create a grouped
GRanges object:
```{r}
chr10_scores_by_signal <- chr10_scores %>%
group_by(signal_type)
```
We can then filter to find the coordinates of the peak containing the maximum
score for each signal.
We can then find a 5000 nt region centred around the maximum position
by anchoring and modifying the width.
```{r max-score-region}
chr10_max_score_region <- chr10_scores_by_signal %>%
filter(score == max(score)) %>%
ungroup() %>%
anchor_center() %>%
mutate(width = 5000)
```
Finally, the overlap inner join is used to restrict the chromosome 10
coverage islands, to the islands that are contained in the 5000nt region that
surrounds the max peak for each signal type.
```{r peak_region}
peak_region <- chr10_scores %>%
join_overlap_inner(chr10_max_score_region) %>%
filter(signal_type.x == signal_type.y)
# the filter ensures overlapping correct methlyation signal
```
#### Exercises
1. Use the `reduce_ranges()` function to find all peaks for each signal type.
2. How could you annotate the scores to find out which genes overlap
each peak found in 1.?
3. Plot a 1000nt window centred around the maximum scores for each signal type
using the `ggbio` or `Gviz` package.
### Worked example: coverage analysis of BAM files
A common quality control check in a genomics workflow is to perform coverage
analysis over features of interest or over the entire genome (again
we see the bioinformatician's love of counting things). Here we use the
airway package to compute coverage histograms and show how you can read BAM
files into memory as GRanges objects with plyranges.
First let's gather all the BAM files available to use in airway (see
`browseVignettes("airway")` for more information about the data and how it
was prepared):
```{r airway_bams}
bfs <- system.file("extdata", package = "airway") %>%
dir(pattern = ".bam",
full.names = TRUE)
# get sample names (everything after the underscore)
names(bfs) <- bfs %>%
basename() %>%
sub("_[^_]+$", "", .)
```
To start let's look at a single BAM file, we can compute the coverage of
the alignments over all contigs in the BAM as follows:
```{r}
first_bam_cvg <- bfs %>%
extract2(1) %>%
compute_coverage()
first_bam_cvg
```
Here the coverage is computed without the entire BAM file being read into
memory. Notice that the score here is the count of the number of alignments
that cover a given range. To compute a coverage histogram, that is the number
of number of bases that have a given coverage score for each contig we can
simply use `summarise()`:
```{r}
first_bam_cvg %>%
group_by(seqnames, score) %>%
summarise(n_bases_covered = sum(width))
```
For RNA-seq experiments we are often interested in splitting up alignments
based on whether the alignment has skipped a region from the reference (
that is, there is an "N" in the cigar string, indicating an intron).
In plyranges this can be achieved with the `chop_by_introns()`, that will
split up an alignment if there's an intron. This results in a grouped ranges
object.
To begin we can read the BAM file using `read_bam()`, this does not read the
entire BAM into memory but instead waits for further functions to be applied
to it.
```{r}
# no index present in directory so we use index = NULL
split_bam <- bfs %>%
extract2(1) %>%
read_bam(index = NULL)
# nothing has been read
split_bam
```
For example we can select elements from the BAM file
to be included in the resulting GRanges to load alignments into memory:
```{r}
split_bam %>%
select(flag)
```
Finally, we can split our alignments using the `chop_by_introns()`:
```{r}
split_bam <- split_bam %>%
chop_by_introns()
split_bam
# look at all the junction reads
split_bam %>%
filter(n() >= 2)
```
#### Exercises
1. Compute the total depth of coverage across all features.
1. How could you compute the proportion of bases covered over an entire genome?
(hint: `get_genome_info` and `S4Vectors::merge`)
1. How could you compute the strand specific genome wide coverage?
1. Create a workflow for computing the strand specific coverage for all BAM
files.
1. For each sample plot total breadth of coverage against the number of bases
covered faceted by each sample name.
## What's next?
The grammar based approach for analysing data provides a consistent approach
for analysing genomics experiments with Bioconductor. We have explored
how plyranges can enable you to perform common analysis tasks required of
a bioinformatician. As plyranges is still a new and growing package, there is
definite room for improvement (and likely bugs), if you have any problems using
it or think things could be easier or clearer please file an issue on
[github](https://github.com/sa-lee/plyranges). Contributions from the Bioconductor
community are also more than welcome!
## Appendix {#appendix}
Category Verb Description
--------------------- ------------------------------- --------------------------------------------------------------------
***summarise()*** aggregate over column(s)
Aggregation *disjoin\_ranges()* aggregate column(s) over the union of end coordinates
*reduce\_ranges()* aggregate column(s) by merging overlapping and neighbouring ranges
***mutate()*** modifies any column
***select()*** select columns
Arithmetic (Unary) ***arrange()*** sort by columns
*stretch()* extend range by fixed amount
*shift\_(direction)* shift coordinates
*flank\_(direction)* generate flanking regions
*%intersection%* row-wise intersection
*%union%* row-wise union
*compute\_coverage* coverage over all ranges
Arithmetic (Binary) *%setdiff%* row-wise set difference
*between()* row-wise gap range
*span()* row-wise spanning range
*join\_overlap\_()* merge by overlapping ranges
*join\_nearest* merge by nearest neighbour ranges
*join\_follow* merge by following ranges
Merging *join\_precedes* merge by preceding ranges
*union\_ranges* range-wise union
*intersect\_ranges* range-wise intersect
*setdiff\_ranges* range-wise set difference
*complement\_ranges* range-wise union
*anchor\_direction()* fix coordinates at direction
Modifier ***group\_by()*** partition by column(s)
*group\_by\_overlaps()* partition by overlaps
***filter()*** subset rows
Restriction *filter\_by\_overlaps()* subset by overlap
*filter\_by\_non\_overlaps()* subset by no overlap
: Overview of the plyranges grammar. The core verbs are briefly
described and categorised into one of: aggregation, unary or binary
arithmetic, merging, modifier, or restriction. A verb is given bold
text if its origin is from the dplyr grammar.