-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
executable file
·152 lines (117 loc) · 5.73 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
output: github_document
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# txtools <a href=''><img src='man/figures/logo.png' align="right" height="139" /></a>
<!-- badges: start -->
`r badger::badge_github_version("AngelCampos/txtools", "blue")`
<!-- badges: end -->
## Description
**txtools** is a package that processes RNA-seq reads alignments into
transcriptomic-oriented tables. Enabling a quick and simplified analysis,
to closely inspect summarized RNA-seq data per transcript, at nucleotide
resolution, i.e. coverage, read-starts, read-ends, deletions, and
nucleotide frequency. Attractive plotting is also readily available to
visualize data.
The main processing pipeline of txtools consists of 1) converting genomic
alignments into transcriptomic space and merging paired-end reads using `tx_reads()`
and 2) summarize counts for 'readouts' (coverage, read-start, read-ends,
nucleotide frequency, deletions) along the transcriptome using a function of the `tx_makeDT_` family.
![Figure 1. txtools' main processing pipeline](man/figures/readme_1.png)
Of note, coverage calculation in paired-end reads using txtools considers the
whole range of the paired alignment. From the start of read1 to the end of
read2, including the insert which is not directly sequenced (Figure 2)
![Figure 2. Paired-end reads coverage computation](man/figures/readme_2.png){ width=60% }
## Installation
You can install the development version from
[GitHub](https://github.com/AngelCampos/txtools) typing in the following
commands in the R console:
```{r, installation, eval = F}
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("AngelCampos/txtools")
```
## Quick example
In this small example we use the 'Pasilla' experiment data (from its own package)
which contains a BAM file for the paired-end alignments of a *D. melanogaster*
RNA-seq experiment on chromosome 4, along with a FASTA file comprising the
genome sequence for the same chromosome.
Using txtools we can load the **genome** (FASTA), the **gene annotation**
(BED-12), and the **RNA-seq reads alignment** (BAM) files into R with ease.
```{r, results = "hide"}
# Load packages
library(txtools)
library(pasillaBamSubset)
# Getting paths to files
BED_file <- tx_dm3_geneAnnot()
FASTA_file <- pasillaBamSubset::dm3_chr4() # Data from pasillaBamSubset pkg
PE_BAM_file <- pasillaBamSubset::untreated3_chr4() # Data from pasillaBamSubset pkg
# Loading D. melanogaster gene annotation, genome, and paired-end bam alignments
dm3_geneAnnot <- tx_load_bed(BED_file)
dm3_genome <- tx_load_genome(FASTA_file)
dm3_PEreads <- tx_load_bam(file = PE_BAM_file, pairedEnd = TRUE, loadSeq = TRUE)
```
First, we process the alignments to their transcriptomic versions using the
`tx_reads()` function.
```{r}
reads_SE <- tx_reads(reads = dm3_PEreads,
geneAnnot = dm3_geneAnnot,
withSeq = TRUE,
nCores = 1,
minReads = 1)
```
Then we just need to summarize the alignments into a txDT. In this case using
the `tx_makeDT_covNucFreq()` function outputs a table with all the base metrics,
including read coverage ('cov' column), and nucleotide frequency
(A,C,T,G columns).
```{r}
DT <- tx_makeDT_covNucFreq(reads_SE, geneAnnot = dm3_geneAnnot, genome = dm3_genome)
```
The resulting txDT comprise all summarized information from the RNA-seq reads
aligned to the genome and contained within the genes in the gene annotation
(this example consists of only the top 10 expressed genes). For more information on the
columns of DT consult the `tx_makeDT_covNucFreq()` documentation.
To extend the base metrics that the `tx_makeDT_*()`
functions provide txtools provides the `tx_add_*()` functions family. One
example of such functions is `tx_add_diffNucToRefRatio()` which calculates the
ratio of nucleotide counts different to the reference sequence. Using
these metric we can easily spot locations in which RNA transcripts sequence is
different from that of the reference sequence.
```{r}
DT <- tx_add_misincRate(DT, addCounts = TRUE)
DT[which(misincRate > 0.5 & nucTotal > 40),]
```
Finally, using the `tx_plot_nucFreq()`
function we can visualize that data in the DT at an specific location.
```{r, plotNucFreq, fig.width = 6, fig.height = 4}
tx_plot_nucFreq(DT, gene = "NM_079901", txRange = window_around(3803, 15))
```
## Further documentation
- The *txtools* user guide is available as a vignette typing
`vignette("txtools")` at the R console.
- The use cases code is available at its own repo in:
https://github.com/AngelCampos/txtools_uc.
## Current limitations:
* Insertions: txtools is not able to deal with insertions. This is mainly
because insertions are not part of the original trasncriptomic nor genomic
reference space as they would alter the length of the gene model.
This could be an added feature in future versions but is not a priority.
* Potentially long processing times: Loading big BAM files into R commonly
requires a lot of time, having this in mind txtools provides a progress bar to
keep users informed about the loading status. Most importantly, depending on the
ammount of both loaded reads and the size of the *Gene Annotation* tx_reads()
processing time can take several minutes. A solution to this issue is the use of
multi-threadding which has been incorporated into tx_reads() and other
functions, but such functionality is only available for UNIX and MAC OS.
## Additional notes:
* As many R packages meant for high-throughput data analysis and manipulation,
using ***txtools*** may require high ammounts of RAM memory, depending mainly on
the size of BAM files being processed.