-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgencode-feature-lengths.Rmd
140 lines (102 loc) · 4.17 KB
/
gencode-feature-lengths.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: "Micro-first exons in GENCODE 26"
author: "Charles Plessy"
output:
html_document:
keep_md: yes
toc: yes
---
```{r options, message = FALSE}
knitr::opts_chunk$set(cache = TRUE)
```
## The question
I am interested in the length of first exons. To get an estimate of what to expect,
I inspected the contents of GENCODE 26. I am surprised to find very small first
exons, as I do not see how these could be produced by splicing: how could the
spliceosome handle a donor fragment that is only a few nucleotides long ?
So my question is *what do these micro-first exons repesent*. Are they just
bugs in processing pipelines ? Do they represent processed transcripts (in which
case its short length is not in contradiction with splicing mechanisms). Or
is there something else I am missing ?
## Data load and preparation
This analysis is done in R. First, let's load R libraries.
```{r load-libraries, message = FALSE}
library(magrittr)
library(GenomicRanges)
```
Then, let's download GENCODE 26.
```{r download-gencode, engine='bash'}
wget --quiet --continue ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
```
Then, let's load GENCODE 26 in R as a GenomicRanges object and select all
exon features.
```{r load-data}
g <- rtracklayer::import.gff("gencode.v26.annotation.gtf.gz")
summary(g)
colnames(mcols(g))
table(g$type)
ge <- g[g$type == "exon"]
mcols(ge) %<>% droplevels
```
## Exon lengths and number
Many transcripts have more than one exon.
```{r}
tapply(ge$exon_number %>% as.numeric, ge$transcript_id, max) %>% table %>% head(50) %>%
plot(main = "Number of exons per GENCODE transcripts", sub = "Transcripts with > 50 exons were omitted.")
```
Exons tend to be short.
```{r exon-length}
width(ge) %>% summary
width(ge) %>% table %>% head(1000) %>%
plot(main = "Length of GENCODE exons", sub = "Exons longer than 100 nt were omitted.")
```
Some Exons are very short; they can be microexons.
```{r micro-exon-length}
width(ge) %>% table %>% head(20) %>%
plot(main = "Length of shortest GENCODE exons", sub = "Exons longer than 20 nt were omitted.")
```
## First exons of spliced transcripts
Let's focus on spliced transcripts.
```{r spliced-transcripts}
spliced <- tapply(ge$exon_number %>% as.numeric, ge$transcript_id, max) > 1
gs <- ge[ge$transcript_id %in% names(which(spliced))]
summary(gs)
tapply(gs$exon_number %>% as.numeric, gs$transcript_id, max) %>%
table %>% head(50) %>%
plot(main = "Number of exons per spliced transcripts", sub = "Transcripts with > 50 exons were omitted.")
width(gs) %>% table %>% head(20) %>%
plot( main = "Length of shortest GENCODE spliced exons"
, sub = "Exons longer than 20 nt were omitted.")
```
And within them, let's focus on first exons. Altogether, their size distribution
is not very different from all the exons together.
```{r first-exons}
gs1 <- gs[gs$exon_number == 1]
width(gs1) %>% summary
width(gs1) %>% table %>% head(1000) %>%
plot(main = "Length of spliced first exons", sub = "Exons longer than 500 nt were omitted.")
```
Surprisingly, even within first exons, there are microexons.
```{r first-microexons}
width(gs1) %>% table %>% head(20) %>%
plot(main = "Length of spliced first exons", sub = "Exons longer than 20 nt were omitted.")
```
## Examples
Inspection in the UCSC browser suggested that many of these micro-first exons
coincided with translation start sites. Indeed, it seems to happen often, but
not systematically.
```{r examples}
options(width=110)
options("showTailLines"=Inf) # http://stackoverflow.com/questions/16949545/show-all-lines-in-genomicrange-package-output
x <- g[g$transcript_id %in% gs1[width(gs1) < 4]$transcript_id]
# For display purposes, let's give names to each entry, and re-sort the names
# of the metadata's columns.
names(x) <- x$transcript_name
mcols(x) <- mcols(x)[,c("type", "transcript_type", "gene_id",
"source", "gene_type", "transcript_id", "level", "havana_gene",
"transcript_name", "transcript_support_level", "tag", "havana_transcript",
"exon_number", "exon_id", "ont", "protein_id", "ccdsid", "score", "phase",
"gene_name")]
x[x$type %in% c("start_codon", "exon") & x$exon_number == 1]
options("showTailLines"=NULL)
```