-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathde-novo-transcript-assembly.Rmd
113 lines (80 loc) · 4.14 KB
/
de-novo-transcript-assembly.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "Assembly and quantification of _de novo_ transcripts"
author: "Sam Buckberry"
date: "02/11/2021"
output: github_document
---
For this section, you will need to have HISAT2 and Stringtie installed and in your PATH to be accessed by the `system()` call from R.
You can build your own alignment index by following the documentation for `hisat2-build` or download the pre-complied index to save a lot of time from [https://genome-idx.s3.amazonaws.com/hisat/grch38_tran.tar.gz](https://genome-idx.s3.amazonaws.com/hisat/grch38_tran.tar.gz)
Load libraries
```{r, eval=FALSE}
library(magrittr)
library(stringr)
library(Rsamtools)
```
Construct a manifest file for mapping all the samples
```{r, eval=FALSE}
r1_list <- list.files("fastq/", pattern = "_filt_fastp_R1.fastq.gz",
full.names = TRUE)
r2_list <- list.files("fastq/", pattern = "_filt_fastp_R2.fastq.gz",
full.names = TRUE)
sam_list <- basename(r1_list) %>%
str_replace("_filt_fastp_R1.fastq.gz", "_hisat2.sam")
dir.create("aligned_data")
sam_list <- str_c("aligned_data/", sam_list)
manifest <- data.frame(r1 = r1_list, r2 = r2_list, sam = sam_list)
write.table(x = manifest, file = "hisat_map_manifest.tsv",
sep = "\t", quote = FALSE, col.names = FALSE,
row.names = FALSE)
```
Align ***trimmed reads*** using HISAT2 with the option `--dta` that is reccomended for the stringtie assembly. For _de novo_ assemblies, removing adapter sequences is much more important that if oyu are just performing differential expression testing.
This code points to the pre-compliled indexes that will be accessible if you are doing this workshop on one of the virtual machines with RStudio.
```{r, eval=FALSE}
align_cmd <- str_c("while read i j k; do hisat2 --time --dta --threads 2 -x /home/data/hisat2/grch38_tran/genome_tran -1 $i -2 $j -S $k; done < hisat_map_manifest.tsv")
system(align_cmd)
```
Convert SAM to BAM, sort and index. Here we will use `mclapply` to speed things up by processing in parallel. All of these samtools functions can be performed in R using Rsamtools.
```{r, eval=FALSE}
# List the hisat2 aligned SAM files
hisat_sam <- list.files(path = "aligned_data/",
pattern = "hisat2.sam$",
full.names = TRUE)
# Convert SAM to BAM
mclapply(hisat_sam, FUN = asBam, mc.cores = 4)
# List the hisat2 aligned BAM files
hisat_bam <- list.files(path = "aligned_data/",
pattern = "hisat2.bam$",
full.names = TRUE)
# Sort the hisat2 BAM files
sorted_bam <- str_replace_all(string = hisat_bam,
pattern = ".bam",
replacement = "_sorted")
mclapply(1:length(hisat_bam),
FUN = function(x){sortBam(hisat_bam[x], sorted_bam[x])},
mc.cores = 4)
# Index the sorted BAM files
mclapply(sorted_bam, indexBam, mc.cores = 4)
```
Make the reference-guided transcript assemblies for each library using Stringtie
```{r, eval=FALSE}
sorted_bam <- list.files("aligned_data/", "_hisat2_sorted.bam$",
full.names = TRUE)
system("pigz -d --keep reference/Homo_sapiens.GRCh38.104.chromosome.22.gtf")
run_stringtie <- function(bam){
gtf <- str_replace(string = bam,
pattern = ".bam",
replacement = ".gtf")
stringtie_cmd <- str_c("stringtie ", bam," -p 12 -G reference/Homo_sapiens.GRCh38.104.chromosome.22.gtf -o ", gtf)
system(stringtie_cmd)
}
mclapply(sorted_bam, run_stringtie, mc.cores = 4)
```
Merge sample assemblies into a meta-assembly of transcripts
```{r, eval=FALSE}
stringtie_gtf <- list.files(path = "aligned_data/",
pattern = ".gtf$",
full.names = TRUE)
stringtie_cmd <- "stringtie aligned_data/*_hisat2_sorted.gtf --merge -G reference/Homo_sapiens.GRCh38.104.chromosome.22.gtf -p 4 -o aligned_data/merged_stringtie_transcripts.gtf"
system(stringtie_cmd)
```
Now you can use the reference guided _de novo_ transcripts GTF with the featureCounts function to get the transcripts counts as in the `map-rna-subread` workflow in this workshop.