-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: ORF and NMD predictions
This is the manual for running the ORF and NMD prediction pipelines.
This manual assumes that you have run TAMA Collapse or TAMA Merge and now have a bed file in their respective formats.
1. Bedtools
2. Blastp
3. Uniref database files
Use Bedtools to convert your bed file into a fasta file.
Example command:
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}
fasta - this is the fasta file for the genome assembly
bed - this is the bed12 file for the annotation
outfile - this is the name of the output fasta file
Use tama_orf_seeker.py to get ORF amino acid sequences.
Example command:
python tama_orf_seeker.py -f ${fasta} -o ${outfile}
fasta - this is the fasta file for your transcript sequences that you produced in the previous step
outfile - the name of the output fasta file which contains the ORF amino acid sequences
Use blastp and a local Uniref database to find protein hits.
Example command:
blastp -evalue 1e-10 -num_threads 16 -db ${uniref} -query ${orf}
uniref - Uniref/Uniprot fasta file
orf - ORF fasta file from previous step
Note: This paramter "-num_threads 16" specifices using 16 threads. Please adjust for your system.
Use tama_orf_blastp_parser.py to parse the results from Blastp.
Example command:
python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile}
blastp - output from Blastp in previous step
outfile - name of output file from parsing
Use tama_cds_regions_bed_add.py to create the annotation bed file with CDS regions added.
Example command:
python tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}
orf - The output file from the previous step
bed - The bed annotation file that you started with in this pipeline
fasta - The transcript fasta file from the first step
outfile - The output bed file with CDS regions
The output of step 5 is a bed file with CDS regions defined by column 7 and 8. Column 7 is the start of the CDS region and column 8 is the end. The 4th column contains the gene ID and transcript ID along with additional information.
This is the format of the 4th column:
gene_id;trans_id;gene_name;trans_length_flag;blastp_match_flag;nmd_flag;frame
This is an example of a line:
1 219 3261 G1;G1.2;R4GIW7;full_length;50_match;prot_ok;F1 40 + 2154 2662 200,0,255 5 98,93,181,107,714 0,1457,1757,2132,2328
gene_id -> G1
trans_id -> G1.2
gene_name -> R4GIW7
This is the name of the gene in the protein database that had a hit with Blastp.
trans_length_flag -> full_length
Either "full_length" or "5prime_degrade". "5prime_degrade" means that the start codon was not found within the ORF and thus the transcript looks like it is incomplete on the 5' end. "full_length" means that the complete ORF was contained within the transcript.
blastp_match_flag -> 50_match
Possible flags include:
full_match - Complete match of ORF and protein.
90_match - >90% match.
50_match - >50% match.
bad_match - <50% match.
no_hit - No matches.
no_orf - No ORF predicted.
nmd_flag -> prot_ok
Possible flags include:
prot_ok - Looks like a protein coding transcript, not an NMD.
NMD# - So this could be NMD1, NMD2, NMD3, and so forth. NMD indicates that this is an NMD and the number indicates the number of splice junctions between the stop codon and the last exon.
missing - This indicates that there were missing nucleotides in the genome assembly which means ORF prediction could not be done.
frame -> F1
This indicates the frame of the ORF with respect to the start of the transcript.