TRANSCRIPTOME ASSEMBLY PRACTICE

TRANSCRIPTOME ASSEMBLY PRACTICE

This tutorial demonstrates how to perform a simple transcriptome assembly based on 2 methods:

Mapping-based assembly
De-novo assembly

Dataset

One can find paired-end fastq files artificially created in here.
Data was originated from human sample using hg19/GRCh37 reference genome.
UCSC human genomes related to hg19 can be found here.
Gene annotation file for GRCh37/hg19 can be found here.

Tools

In this practical transcriptome assembly, these tools are utilized for simple and easy implementation.

Tools	Usage	Github
Hisat2	Building an index Alignment	https://github.com/DaehwanKimLab/hisat2
Samtools	Manipulating bam file	https://github.com/samtools/samtools
Cufflinks	Assemble transcript Estimate abundances	https://github.com/cole-trapnell-lab/cufflinks
Velvet	Short read de novo assembler using de Bruijn graphs	https://github.com/dzerbino/velvet
Oases	De novo transcriptome assembler for short reads	https://github.com/dzerbino/oases/

Conda/Miniconda installation

Every tools and their dependencies that are required in this tutorial will be installed through Conda/Miniconda.

Please use this environment.yml file which includes all the tools needed for this practice.

Using this conda command line to generate necessary Conda/Miniconda evnvironment:

conda env create -f environment.yml

If everything is installed correctly, one can check if rna_assembly conda environment exists.

conda env list

Please activate rna_assembly environment before any further steps.

conda activate rna_assembly

Mapping-based assembly method

Building an index using HISAT2
```
hisat2-build ref.fa ref
```

Mapping paired-end reads with reference genome

hisat2 \
    -x ref \
    -1 sample_R1.fastq.gz \
    -2 sample_R2.fastq.gz \
    -S sample.sam

Converting to bam format and sorting by coordinates

samtools view -Sb sample.sam > sample.bam
samtools sort sample.bam -o sample.sorted.bam

Reconstruct full-length transcript sequences based on RNA-seq read mapping using Cufflinks

cufflinks \
    -p 8 \
    --library-type fr-firststrand \
    -G "/path/to/*.gtf" \
    -o "/path/to/outdir/" "sample.sorted.bam"

Output of Cufflinks is given in BED or GTF format which contains the transcript coordinates in a reference sequence.

De novo assembly

Interleave paired-end reads (forward + reverse)

python3 interleave_fastq.py -f "/path/to/sample_R1.fastq" -r "/path/to/sample_R2.fastq" -o "/path/to/outFile"

OR

paste <(cat "/path/to/sample_R1.fastq") <(cat "/path/to/sample_R2.fastq") | paste - - - - | awk 'BEGIN{FS="\t"; OFS="\n"}; {print $1,$3,$5,$7,$2,$4,$6,$8}' > "sample_interleaved.fastq"

Define k-mer length and the output dir (e.g., k-mer=25, outdir="vdir")
```
velveth vdir 25 -fastq -shortPaired sample_interleaved.fastq
```
Graph traversal and contig extraction, insert size is 200 ( One can specify insert size based on your interest and data)
```
velvetg vdir -ins_length 200 -read_trkg yes
```
Oases is applied to the resulting de Bruijn graph
```
oases vdir -ins_length 200 -min_trans_lgth 200
```

Running several assemblies with different k-mer lengths and merge them altogether (Optional)

oases_pipeline.py \
    -m 25 -M 31 \
    -o output_denovo/pairedEnd \
    -d ' -fastq -shortPaired sample_interleaved.fastq ' \
    -p ' -ins_length 200 -min_trans_lgth 200 '

Output directory vdir contains the transcript sequences which are stored in FASTA file transcript.fa. The name of each FASTA entry describes the locus and isoforms. Another file produced by Oases is contig-ordering.txt.

An example of a FASTA entry name is Locus_10_Transcript_1/3_Confidence_0.571_Length_3815.

It indicates that there are three transcripts from locus 10, and this is the first of them.
The confidence value is a number between 0 and 1 (the higher the better), and length is the transcript length in base pairs.

NOTE:

Please remember to specify your own path to each of the data or tool that are mentioned in this practice.
If one encounter errors when running step 5, please replace the original script in conda rna_assembly environment with this modified script. Please also rename it to match the original script's name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

transcriptome_assembly.md

transcriptome_assembly.md

TRANSCRIPTOME ASSEMBLY PRACTICE

Dataset

Tools

Conda/Miniconda installation

Mapping-based assembly method

De novo assembly

Collapse file tree

Files

transcriptome_assembly.md

Latest commit

History

transcriptome_assembly.md

File metadata and controls

TRANSCRIPTOME ASSEMBLY PRACTICE

Dataset

Tools

Conda/Miniconda installation

Mapping-based assembly method

De novo assembly