Skip to content

Latest commit

 

History

History
152 lines (103 loc) · 5.2 KB

transcriptome_assembly.md

File metadata and controls

152 lines (103 loc) · 5.2 KB

TRANSCRIPTOME ASSEMBLY PRACTICE


This tutorial demonstrates how to perform a simple transcriptome assembly based on 2 methods:

  • Mapping-based assembly
  • De-novo assembly

Dataset

  • One can find paired-end fastq files artificially created in here.

  • Data was originated from human sample using hg19/GRCh37 reference genome.

  • UCSC human genomes related to hg19 can be found here.

  • Gene annotation file for GRCh37/hg19 can be found here.

Tools

In this practical transcriptome assembly, these tools are utilized for simple and easy implementation.

Tools Usage Github
Hisat2 Building an index
Alignment
https://github.com/DaehwanKimLab/hisat2
Samtools Manipulating bam file https://github.com/samtools/samtools
Cufflinks Assemble transcript
Estimate abundances
https://github.com/cole-trapnell-lab/cufflinks
Velvet Short read de novo assembler
using de Bruijn graphs
https://github.com/dzerbino/velvet
Oases De novo transcriptome assembler
for short reads
https://github.com/dzerbino/oases/

Conda/Miniconda installation

Every tools and their dependencies that are required in this tutorial will be installed through Conda/Miniconda.

Please use this environment.yml file which includes all the tools needed for this practice.

Using this conda command line to generate necessary Conda/Miniconda evnvironment:

conda env create -f environment.yml

If everything is installed correctly, one can check if rna_assembly conda environment exists.

conda env list

Please activate rna_assembly environment before any further steps.

conda activate rna_assembly

Mapping-based assembly method

  1. Building an index using HISAT2

    hisat2-build ref.fa ref
  2. Mapping paired-end reads with reference genome

    hisat2 \
        -x ref \
        -1 sample_R1.fastq.gz \
        -2 sample_R2.fastq.gz \
        -S sample.sam
  3. Converting to bam format and sorting by coordinates

    samtools view -Sb sample.sam > sample.bam
    samtools sort sample.bam -o sample.sorted.bam
  4. Reconstruct full-length transcript sequences based on RNA-seq read mapping using Cufflinks

    cufflinks \
        -p 8 \
        --library-type fr-firststrand \
        -G "/path/to/*.gtf" \
        -o "/path/to/outdir/" "sample.sorted.bam"

Output of Cufflinks is given in BED or GTF format which contains the transcript coordinates in a reference sequence.

De novo assembly

  1. Interleave paired-end reads (forward + reverse)

    python3 interleave_fastq.py -f "/path/to/sample_R1.fastq" -r "/path/to/sample_R2.fastq" -o "/path/to/outFile"

    OR

    paste <(cat "/path/to/sample_R1.fastq") <(cat "/path/to/sample_R2.fastq") | paste - - - - | awk 'BEGIN{FS="\t"; OFS="\n"}; {print $1,$3,$5,$7,$2,$4,$6,$8}' > "sample_interleaved.fastq"
  2. Define k-mer length and the output dir (e.g., k-mer=25, outdir="vdir")

    velveth vdir 25 -fastq -shortPaired sample_interleaved.fastq
  3. Graph traversal and contig extraction, insert size is 200 ( One can specify insert size based on your interest and data)

    velvetg vdir -ins_length 200 -read_trkg yes
  4. Oases is applied to the resulting de Bruijn graph

    oases vdir -ins_length 200 -min_trans_lgth 200
  5. Running several assemblies with different k-mer lengths and merge them altogether (Optional)

    oases_pipeline.py \
        -m 25 -M 31 \
        -o output_denovo/pairedEnd \
        -d ' -fastq -shortPaired sample_interleaved.fastq ' \
        -p ' -ins_length 200 -min_trans_lgth 200 '

Output directory vdir contains the transcript sequences which are stored in FASTA file transcript.fa. The name of each FASTA entry describes the locus and isoforms. Another file produced by Oases is contig-ordering.txt.

An example of a FASTA entry name is Locus_10_Transcript_1/3_Confidence_0.571_Length_3815.

  • It indicates that there are three transcripts from locus 10, and this is the first of them.
  • The confidence value is a number between 0 and 1 (the higher the better), and length is the transcript length in base pairs.

NOTE:

  1. Please remember to specify your own path to each of the data or tool that are mentioned in this practice.
  2. If one encounter errors when running step 5, please replace the original script in conda rna_assembly environment with this modified script. Please also rename it to match the original script's name.