TAMA GO: ORF and NMD predictions

This is the manual for running the ORF and NMD prediction pipelines.

This manual assumes that you have run TAMA Collapse or TAMA Merge and now have a bed file in their respective formats.

Table of Contents Other software and data required for this pipeline: Pipeline: 1. Converting the bed files into fasta files 2. Getting open read frames (ORF) from the transcript sequences 3. Blasting the amino acid sequences against the Uniprot/Uniref protein database 4. Parsing the Blastp output file for top hits 5. Create new bed file with CDS regions Explanation of Output:

Other software and data required for this pipeline:

1. Bedtools

2. Blastp

3. Uniref database files

Pipeline:

1. Converting the bed files into fasta files

Use Bedtools to convert your bed file into a fasta file.

Example command:

bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}

fasta - this is the fasta file for the genome assembly

bed - this is the bed12 file for the annotation

outfile - this is the name of the output fasta file

2. Getting open read frames (ORF) from the transcript sequences

Use tama_orf_seeker.py to get ORF amino acid sequences.

Example command:

python tama_orf_seeker.py -f ${fasta} -o ${outfile}

fasta - this is the fasta file for your transcript sequences that you produced in the previous step

outfile - the name of the output fasta file which contains the ORF amino acid sequences

3. Blasting the amino acid sequences against the Uniprot/Uniref protein database

Use blastp and a local Uniref database to find protein hits.

Example command:

blastp -evalue 1e-10 -num_threads 16 -db ${uniref} -query ${orf}

uniref - Uniref/Uniprot fasta file

orf - ORF fasta file from previous step

Note: This paramter "-num_threads 16" specifices using 16 threads. Please adjust for your system.

4. Parsing the Blastp output file for top hits

Use tama_orf_blastp_parser.py to parse the results from Blastp.

Example command:

python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile}

blastp - output from Blastp in previous step

outfile - name of output file from parsing

5. Create new bed file with CDS regions

Use tama_cds_regions_bed_add.py to create the annotation bed file with CDS regions added.

Example command:

python tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}

orf - The output file from the previous step

bed - The bed annotation file that you started with in this pipeline

fasta - The transcript fasta file from the first step

outfile - The output bed file with CDS regions

Explanation of Output:

The output of step 5 is a bed file with CDS regions defined by column 7 and 8. Column 7 is the start of the CDS region and column 8 is the end. The 4th column contains the gene ID and transcript ID along with additional information.

This is the format of the 4th column:

gene_id;trans_id;gene_name;trans_length_flag;blastp_match_flag;nmd_flag;frame

This is an example of a line:

  1       219     3261    G1;G1.2;R4GIW7;full_length;50_match;prot_ok;F1  40      +       2154    2662    200,0,255       5       98,93,181,107,714       0,1457,1757,2132,2328

gene_id -> G1

trans_id -> G1.2

gene_name -> R4GIW7

This is the name of the gene in the protein database that had a hit with Blastp.

trans_length_flag -> full_length

Either "full_length" or "5prime_degrade". "5prime_degrade" means that the start codon was not found within the ORF and thus the transcript looks like it is incomplete on the 5' end. "full_length" means that the complete ORF was contained within the transcript.

blastp_match_flag -> 50_match

Possible flags include:

full_match - Complete match of ORF and protein.

90_match - >90% match.

50_match - >50% match.

bad_match - <50% match.

no_hit - No matches.

no_orf - No ORF predicted.

nmd_flag -> prot_ok

Possible flags include:

prot_ok - Looks like a protein coding transcript, not an NMD.

NMD# - So this could be NMD1, NMD2, NMD3, and so forth. NMD indicates that this is an NMD and the number indicates the number of splice junctions between the stop codon and the last exon.

missing - This indicates that there were missing nucleotides in the genome assembly which means ORF prediction could not be done.

frame -> F1

This indicates the frame of the ORF with respect to the start of the transcript.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAMA GO: ORF and NMD predictions

Table of Contents

Other software and data required for this pipeline:

Pipeline:

1. Converting the bed files into fasta files

2. Getting open read frames (ORF) from the transcript sequences

3. Blasting the amino acid sequences against the Uniprot/Uniref protein database

4. Parsing the Blastp output file for top hits

5. Create new bed file with CDS regions

Explanation of Output:

Clone this wiki locally