-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: ORF and NMD predictions
This is the manual for running the ORF and NMD prediction pipelines.
This manual assumes that you have run TAMA Collapse or TAMA Merge and now have a bed file in their respective formats.
1. Bedtools
2. Blastp
3. Uniref database files
Use Bedtools to convert your bed file into a fasta file.
Example command:
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}
fasta - this is the fasta file for the genome assembly
bed - this is the bed12 file for the annotation
outfile - this is the name of the output fasta file
The output fasta headers should look like this:
>G1;G1.1::1:29-2665(+) >G1;G1.2::1:219-3261(+) >G1;G1.3::1:1713-3246(+)
If you are using a newer version of bedtools it may look like this:
>G1;G1.1(+) >G1;G1.2(+) >G1;G1.3(+))
Use tama_orf_seeker.py to get ORF amino acid sequences.
Example command:
python tama_orf_seeker.py -f ${fasta} -o ${outfile}
fasta - this is the fasta file for your transcript sequences that you produced in the previous step
outfile - the name of the output fasta file which contains the ORF amino acid sequences
The output fasta file should have headers like this:
>G1;G1.1::1:29-2665(+):F1:0:719:1:239:239:I >G1;G1.1::1:29-2665(+):F1:132:719:45:239:195:M >G1;G1.1::1:29-2665(+):F3:2:64:1:20:20:F
If you used a newer version of bedtools in the first step it may look like this:
>G1;G1.1(+):F1:0:719:1:239:239:I >G1;G1.1(+):F1:132:719:45:239:195:M >G1;G1.1(+):F3:2:64:1:20:20:F
Use blastp and a local Uniref database to find protein hits.
Example command:
blastp -evalue 1e-10 -num_threads 16 -db ${uniref} -query ${orf}
uniref - Uniref/Uniprot fasta file
orf - ORF fasta file from previous step
Note: This paramter "-num_threads 16" specifices using 16 threads. Please adjust for your system.
The individual hit entries in the output file should look like this:
Score = 382 bits (982), Expect = 4e-136, Method: Compositional matrix adjust. Identities = 203/235 (86%), Positives = 209/235 (89%), Gaps = 12/235 (5%) Query 17 KSVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE 76 + VLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE Sbjct 31 REVLAKKSAPPAPLCPQPGPSLPLSPHTMGAVPHLHGAKGERERRSPSPPREATAREGDE 90 Query 77 ERQSQRGSGCSELRQNRHHVLCVA---VLCILVLALVAVIVLQRLSCLPPPPFSHVW--- 130 ERQSQRGSGCS+LRQNR VLCVA VLCILV +VAVIVLQR SC P PPFSHV+ Sbjct 91 ERQSQRGSGCSKLRQNRRCVLCVALCAVLCILVSVVVAVIVLQRPSCPPRPPFSHVYPNT 150 Query 131 ------KCYYFSYTKSDWNSSREHCHRLGASLATVDTEEEMEFMRGYRRPADRWIGLHRA 184 KCYYFS TKSDW SSRE CHRLGASLATVD EEEMEFMRGYRRPADRWIGLHRA Sbjct 151 WVGFHGKCYYFSDTKSDWKSSREQCHRLGASLATVDREEEMEFMRGYRRPADRWIGLHRA 210 Query 185 EGDEHWTWADGSAFSNWFELRGGGQCAYLHGDWINSTLCHSEKFWVCSRADSYVL 239 EGDEHWTWADGSAFSNWF+ +GGGQC YLHGDWINSTLCHSEKFWVCSRADSYVL Sbjct 211 EGDEHWTWADGSAFSNWFKPQGGGQCVYLHGDWINSTLCHSEKFWVCSRADSYVL 265
Use tama_orf_blastp_parser.py to parse the results from Blastp.
Example command:
python tama_orf_blastp_parser.py -b ${blastp} -o ${outfile}
blastp - output from Blastp in previous step
outfile - name of output file from parsing
Use tama_cds_regions_bed_add.py to create the annotation bed file with CDS regions added.
Example command:
python tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfile}
orf - The output file from the previous step
bed - The bed annotation file that you started with in this pipeline
fasta - The transcript fasta file from the first step
outfile - The output bed file with CDS regions
The output of step 5 is a bed file with CDS regions defined by column 7 and 8. Column 7 is the start of the CDS region and column 8 is the end. The 4th column contains the gene ID and transcript ID along with additional information.
This is the format of the 4th column:
gene_id;trans_id;gene_name;trans_length_flag;blastp_match_flag;nmd_flag;frame
This is an example of a line:
1 219 3261 G1;G1.2;R4GIW7;full_length;50_match;prot_ok;F1 40 + 2154 2662 200,0,255 5 98,93,181,107,714 0,1457,1757,2132,2328
gene_id -> G1
trans_id -> G1.2
gene_name -> R4GIW7
This is the name of the gene in the protein database that had a hit with Blastp.
trans_length_flag -> full_length
Either "full_length" or "5prime_degrade". "5prime_degrade" means that the start codon was not found within the ORF and thus the transcript looks like it is incomplete on the 5' end. "full_length" means that the complete ORF was contained within the transcript.
blastp_match_flag -> 50_match
Possible flags include:
full_match - Complete match of ORF and protein.
90_match - >90% match.
50_match - >50% match.
bad_match - <50% match.
no_hit - No matches.
no_orf - No ORF predicted.
nmd_flag -> prot_ok
Possible flags include:
prot_ok - Looks like a protein coding transcript, not an NMD.
NMD# - So this could be NMD1, NMD2, NMD3, and so forth. NMD indicates that this is an NMD and the number indicates the number of splice junctions between the stop codon and the last exon.
missing - This indicates that there were missing nucleotides in the genome assembly which means ORF prediction could not be done.
frame -> F1
This indicates the frame of the ORF with respect to the start of the transcript.