This tutorial demonstrates how to perform a simple transcriptome assembly based on 2 methods:
- Mapping-based assembly
- De-novo assembly
-
One can find paired-end fastq files artificially created in here.
-
Data was originated from human sample using hg19/GRCh37 reference genome.
-
UCSC human genomes related to hg19 can be found here.
-
Gene annotation file for GRCh37/hg19 can be found here.
In this practical transcriptome assembly, these tools are utilized for simple and easy implementation.
Tools | Usage | Github |
---|---|---|
Hisat2 | Building an index Alignment |
https://github.com/DaehwanKimLab/hisat2 |
Samtools | Manipulating bam file | https://github.com/samtools/samtools |
Cufflinks | Assemble transcript Estimate abundances |
https://github.com/cole-trapnell-lab/cufflinks |
Velvet | Short read de novo assembler using de Bruijn graphs |
https://github.com/dzerbino/velvet |
Oases | De novo transcriptome assembler for short reads |
https://github.com/dzerbino/oases/ |
Every tools and their dependencies that are required in this tutorial will be installed through Conda/Miniconda.
Please use this environment.yml file which includes all the tools needed for this practice.
Using this conda command line to generate necessary Conda/Miniconda evnvironment:
conda env create -f environment.yml
If everything is installed correctly, one can check if rna_assembly
conda environment exists.
conda env list
Please activate rna_assembly
environment before any further steps.
conda activate rna_assembly
-
Building an index using HISAT2
hisat2-build ref.fa ref
-
Mapping paired-end reads with reference genome
hisat2 \ -x ref \ -1 sample_R1.fastq.gz \ -2 sample_R2.fastq.gz \ -S sample.sam
-
Converting to bam format and sorting by coordinates
samtools view -Sb sample.sam > sample.bam samtools sort sample.bam -o sample.sorted.bam
-
Reconstruct full-length transcript sequences based on RNA-seq read mapping using Cufflinks
cufflinks \ -p 8 \ --library-type fr-firststrand \ -G "/path/to/*.gtf" \ -o "/path/to/outdir/" "sample.sorted.bam"
Output of Cufflinks is given in BED or GTF format which contains the transcript coordinates in a reference sequence.
-
Interleave paired-end reads (forward + reverse)
python3 interleave_fastq.py -f "/path/to/sample_R1.fastq" -r "/path/to/sample_R2.fastq" -o "/path/to/outFile"
OR
paste <(cat "/path/to/sample_R1.fastq") <(cat "/path/to/sample_R2.fastq") | paste - - - - | awk 'BEGIN{FS="\t"; OFS="\n"}; {print $1,$3,$5,$7,$2,$4,$6,$8}' > "sample_interleaved.fastq"
-
Define k-mer length and the output dir (e.g., k-mer=25, outdir="vdir")
velveth vdir 25 -fastq -shortPaired sample_interleaved.fastq
-
Graph traversal and contig extraction, insert size is 200 ( One can specify insert size based on your interest and data)
velvetg vdir -ins_length 200 -read_trkg yes
-
Oases is applied to the resulting de Bruijn graph
oases vdir -ins_length 200 -min_trans_lgth 200
-
Running several assemblies with different k-mer lengths and merge them altogether (Optional)
oases_pipeline.py \ -m 25 -M 31 \ -o output_denovo/pairedEnd \ -d ' -fastq -shortPaired sample_interleaved.fastq ' \ -p ' -ins_length 200 -min_trans_lgth 200 '
Output directory vdir contains the transcript sequences which are stored in FASTA file transcript.fa
. The name of each FASTA entry describes the locus and isoforms. Another file produced by Oases is contig-ordering.txt
.
An example of a FASTA entry name is Locus_10_Transcript_1/3_Confidence_0.571_Length_3815.
- It indicates that there are three transcripts from locus 10, and this is the first of them.
- The confidence value is a number between 0 and 1 (the higher the better), and length is the transcript length in base pairs.
NOTE:
- Please remember to specify your own path to each of the data or tool that are mentioned in this practice.
- If one encounter errors when running step 5, please replace the original script in conda
rna_assembly
environment with this modified script. Please also rename it to match the original script's name.