-
Notifications
You must be signed in to change notification settings - Fork 1
Query projection & gene notation
The elementary unit of a TOGA2 annotation is a transcript isoform, internally referred to as ’transcript’ in the case of reference annotation and ‘projection’ in case of transcripts annotated in the query.
While the gene-to-isoform correspondence should ideally be known for the reference and provided as input to TOGA2, query genes are inferred by TOGA2 based on ‘same strand, coding exon overlap’, meaning all coding exons of all projections that overlap by at least 1 base on the same strand will be considered to belong to the same gene.
A projection refers to a transcript of a reference gene that corresponds to a locus in the query genome. If you attached the gene symbol or identifier to the transcript identifier (transcriptID#geneID, which is recommended), TOGA2 will add the chain ID to create a projection name as transcriptID#geneID#chainID.
Most projections that provided in the final TOGA2 output correspond to a reference transcript and its orthologous query locus (inferred via the alignment chain). In the following special cases, TOGA2 will add another suffix to the projection name:
-
Paralogous projections. If TOGA2 does not classify any projections for a given transcript as orthologous, it can use paralogous projection chains for annotation. These projections are discarded if they overlap orthologous predictions from different transcript; if such projection survives to the end of the pipeline, TOGA2 assumes that the query locus contains a real coding gene that it fails to assign orthology relation to, and includes it into the final annotation. Paralogous projections are indicated by
#paralogsuffix:transcriptID#geneID#chainID#paralogchr2 176795719 176811274 NM_001165881.3#ZNF268#22652#paralog 0 + 176795719 176811274 255,160,120 4 31,127,61,2213, 0,11642,11975,13342, -
Retrogenes. Retrogene candidates are processed pseudogenes that have an intact open reading frame and thus could encode for a functional protein (if the locus is transcribed). By default, TOGA2 annotates processed pseudogenes predictions (output file
processed_pseudogenes.bed), but it will only test for an intact reading frame if the--annotate_processed_pseudogenesflag is set. If this flag is set, CESAR alignment and loss status classification is performed, and retrogene candidates (processed pseudogenes classified as Fully Intact or Intact) are added to final annotation files (query_annotation.bed,query_annotation.with_utrs.bed,query_genes.tsv/bed, and UCSC BigBed files). Retrogene candidate projections are indicated with the#retrosuffix:transcriptID#geneID#chainID#retrochr17 6131853 6132405 ENST00000331825.11#FTL#129374#retro 0 - 6131853 6132405 0,0,100 1 552, 0, -
Fragmented projections. In case of fragmented genome assemblies, where a gene is split across several scaffolds, or in case of chaining artifacts, a reference transcript can have multiple orthologous chains, each aligning different reference exons to different query loci. Projections of such fragmented transcripts have multiple comma-separated chain identifiers in their name (instead of only a single chain identifier). Since fragments are often annotated on different scaffolds or different DNA strands, each fragment gets its own entry in the final BED and BigBed files. To make the name of each fragment unique, a numeric identifier is added to each fragment according to the order of exons annotated in these fragments, prepended by a dollar sign:
transcriptID#geneID#chainID1,chainID2$fragmentIDchr1 64691390 64691400 XM_011530210.3#DRICH1#239649,25163$1 0 + 64691390 64691400 130,130,130 1 10, 0, ## contains the first projection’s exon chr16 16864784 16912108 XM_011530210.3#DRICH1#239649,25163$2 0 - 16864784 16912108 130,130,130 3 188,33,28, 0,17450,47296, ## contains the last projection’s exon
Tip
Among other things, projection postfixes facilitate quick results analysis. For example, to get the total number of annotated orthologous transcript, you can use the following or similar command:
grep -Fv “#retro” query_annotation.bed | grp -Fv “#paralog” | cut -f4 | cut -d’$’ -f1 | sort -u | wc -l
Note that you have to split projection names by the dollar sign (‘$’) to account for potential fragmented projections. Conversely, paralogous and retrogene projections are guaranteed not to be fragmented. To get the number of predicted paralogs or retrogenes, do the following:
grep -Fc “#paralog” query_annotation.bed ## or replace “#paralog” with “#retro"
Note
Projection names are assigned at the alignment, meaning that a single transcriptID#geneID#chainID projection identifier corresponds to the same projection in all the alignment step output files (query_annotation.bed, query_genes.tsv, nucleotide.fa(.gz), etc.). Note, however, that #paralog and #retro suffixes are double-checked and fixed at the finalize step. Certain files outside from the top output level (e.g., meta/ directory contents) are not revised at the this step and therefore do not contain these suffixes. To check their data in the meta/ files, strip the additional suffixes first:
echo “transcriptId#chainId#retro” | sed ’s/#retro//g’ | zgrep -f- -Fw meta/exon_meta.tsv.gz
echo “transcriptId#chainId1,chainId2$1” | cut -d’$’ -f1 | zgrep -f- -Fw meta/exon_meta.tsv.gz
Which transcripts belong to a query gene is inferred from overlapping projections. Two projections that overlap by at least one coding exon base on the same strand are attributed to the same query gene. Importantly, orthologous, paralogous, and retrogene candidate projections cannot be mixed within a single query gene. That means if a paralogous projection overlaps a same-strand orthologous projection by at least one coding base, it is discarded from the final results and is not considered in query gene inference. Likewise, processed pseudogenes (and the subset of retrogene candidates) that overlapping orthologous or paralogous predictions are not considered in query gene inference.
Tip
Thus, gene inference precedence order is: ortholog > paralog > processed_pseudogene; orthologous predictions ‘evict’ paralogs annotated in the same locus, and both orthologs and paralogs cancel the overlapping processed pseudogene predictions.
That means only those paralogous projections and retrogene candidates that remain in the final annotation are considered potential query genes.
Query gene names are named in two steps:
- at the
gene_inferencestep, identified query genes are assigned a technical name in thereg_${num}notation. If you halt execution at this step or anywhere else prior to thefinalizestep, genes will preserve these names - at the
finalizestep, the refined query gene predictions get their names after their orthologous reference genes, utilizing the orthology relationship between reference and query based on theorthology_classification.tsv.
Important
Before version 2.0.5, TOGA2 retained reg_${num} names for lost and missing orthologs, paralogs, and processed pseudogenes in the final output files.
Genes inferred from functional (non-lost/missing) orthologous projections get their names based on the results presented in orthology_classification.tsv according to their orthology relationship status.
In the easiest case of single-copy orthologs, the query locus gets its name after its sole ortholog in the reference:
orthology_classification.tsv:
ENSG00000000003 ENST00000373020.9#TSPAN6 ENSG00000000003 ENST00000373020.9#TSPAN6#11 one2one
query_genes.tsv:
ENSG00000000003 ENST00000373020.9#TSPAN6#11
query_genes.bed:
chrX 133892697 133898294 ENSG00000000003 0 -
Names for many:1 loci reflect their multiple orthologs in the reference: *For genes with two or three orthologs, the query name is a comma-separated list of reference gene names
- For genes with more than three orthologs, a random ortholog’s name followed by ‘plus’ symbol (‘+’) is used as a query gene’s name
orthology_classification.tsv:
## 2-3:one
ENSG00000005075 NM_001371100.1#POLR2J ENSG00000005075,ENSG00000285437 NM_001371100.1#POLR2J#180 many2one
ENSG00000005075 NM_001393919.1#POLR2J ENSG00000005075,ENSG00000285437 NM_001393919.1#POLR2J#180 many2one
ENSG00000285437 ENST00000608621.5#POLR2J3 ENSG00000005075,ENSG00000285437 ENST00000608621.5#POLR2J3#1106777 many2one
ENSG00000285437 ENST00000621093.5#POLR2J3 ENSG00000005075,ENSG00000285437 ENST00000621093.5#POLR2J3#1106777 many2one
ENSG00000248333 ENST00000341832.11#CDK11B ENSG00000248333,ENSG00000008128 ENST00000341832.11#CDK11B#5 many2one
## >3:one
ENSG00000144218 ENST00000672756.2#AFF3 ENSG00000144218+ ENST00000672756.2#AFF3#3 many2one
ENSG00000153107 ENST00000341068.8#ANAPC1 ENSG00000144218+ ENST00000341068.8#ANAPC1#170 many2one
ENSG00000135968 ENST00000309863.11#GCC2 ENSG00000144218+ ENST00000309863.11#GCC2#225 many2one
ENSG00000153201 ENST00000283195.11#RANBP2 ENSG00000144218+ ENST00000283195.11#RANBP2#225 many2one
query_genes.tsv:
## 2-3:one
ENSG00000005075,ENSG00000285437 ENST00000621093.5#POLR2J3#1106777
ENSG00000005075,ENSG00000285437 NM_001371100.1#POLR2J#180
ENSG00000005075,ENSG00000285437 ENST00000608621.5#POLR2J3#1106777
ENSG00000005075,ENSG00000285437 ENST00000292614.10#POLR2J#180
ENSG00000005075,ENSG00000285437 NM_001393919.1#POLR2J#180
## >3:one
ENSG00000144218+ XM_017004841.2#RGPD6#772,918
ENSG00000144218+ XM_017004739.3#RGPD3#13101,750
ENSG00000144218+ XM_047446447.1#GCC2#225
...
query_genes.bed:
## 2-3:one
chr5 136116752 136122777 ENSG00000005075,ENSG00000285437 0 +
## >3:one
chr10 58255565 58493968 ENSG00000144218+ 0 +
Multiple copies of an individual gene are marked with a lowercase Latin letter preceded by underscore (_a, _b, _c etc). Numeration corresponds to sorted chain identifier order.
orthology_classification.tsv:
ENSG00000002726 ENST00000360937.9#AOC1 ENSG00000002726_a ENST00000360937.9#AOC1#14 one2many
ENSG00000002726 ENST00000493429.5#AOC1 ENSG00000002726_a ENST00000493429.5#AOC1#14 one2many
ENSG00000002726 ENST00000360937.9#AOC1 ENSG00000002726_b ENST00000360937.9#AOC1#12271 one2many
query_genes.tsv:
ENSG00000002726_a ENST00000493429.5#AOC1#14
ENSG00000002726_a ENST00000360937.9#AOC1#14
ENSG00000002726_b ENST00000360937.9#AOC1#12271
query_genes.bed:
chr6 48905191 48908833 ENSG00000002726_a 0 +
chr6 48975142 48977773 ENSG00000002726_b 0 +
We use the naming logic for many:1 loci (detailed above), with one key difference. Since a reference gene in a many:many orthology clique can correspond to more than one query locus, occurrence of each reference gene in the query gene names bears copy number preceded by underscore. A schematic representation of gene naming looks as follows:

orthology_classification.tsv:
ENSG00000241149 ENST00000330020.5#ZNF722 ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1 ENST00000330020.5#ZNF722#3716 many2many
ENSG00000197472 ENST00000339986.8#ZNF695 ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1 ENST00000339986.8#ZNF695#17609 many2many
ENSG00000197008 NM_006524.4#ZNF138 ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1 NM_006524.4#ZNF138#104978 many2many
ENSG00000196705 ENST00000311048.11#ZNF431 ENSG00000196705_1+ ENST00000311048.11#ZNF431#17271 many2many
ENSG00000196705 NM_001319124.2#ZNF431 ENSG00000196705_1+ NM_001319124.2#ZNF431#17271 many2many
ENSG00000197008 NM_006524.4#ZNF138 ENSG00000152926_2+ NM_006524.4#ZNF138#37436 many2many
ENSG00000197008 ENST00000307355.12#ZNF138 ENSG00000197008_3 ENST00000307355.12#ZNF138#80958 many2many
ENSG00000197008 NM_006524.4#ZNF138 ENSG00000197008_3 NM_006524.4#ZNF138#80958 many2many
ENSG00000197008 ENST00000307355.12#ZNF138 ENSG00000197008_4 ENST00000307355.12#ZNF138#278677 many2many
ENSG00000197008 ENST00000307355.12#ZNF138 ENSG00000197008_5 ENST00000307355.12#ZNF138#85927 many2many
query_genes.tsv:
ENSG00000241149_3 ENST00000330020.5#ZNF722#18792
ENSG00000241149_3 ENST00000421025.3#ZNF679#65572
ENSG00000241149_3 ENST00000343769.6#ZNF93#6075
ENSG00000184635_1,ENSG00000241149_6 ENST00000330020.5#ZNF722#172738
ENSG00000184635_1,ENSG00000241149_6 ENST00000343769.6#ZNF93#2432
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 XM_047439655.1#ZNF682#953984
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 ENST00000397165.7#ZNF682#953984
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 XM_047439805.1#LOC124904792#12602
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 ENST00000330020.5#ZNF722#9367
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 XM_047416470.1#LOC124900642#332773
...
query_genes.bed:
chr6 131083603 131104342 ENSG00000241149_3,ENSG00000197472_1,ENSG00000197008_5 0 -
chr1 117741369 117802425 ENSG00000184635_1,ENSG00000241149_6 0 +
chr13 67442304 67449476 ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1 0 -
...
These genes obtain their names just like the functional orthologs; however, to discriminate these genes from functional orthologs, their status is listed in the prefix of their name.
Important
Prefixes designate the status of query gene, not the individual reference genes whose projection contributed to this gene inference. For example, a name lost_ENSG00000135314,ENSG00000256980 describes a lost query gene annotated based on projections from reference genes ENSG00000135314 and ENSG00000256980, not a lost orthologous copy ENSG00000135314 + intact ortholog of ENSG00000256980.
Other genes classified as lost or missing are named like the 1:1 and many:1 orthologous projections (detailed above). In any case, a reference gene name appears more than once in the lost/missing query gene names, it gets a _${num} suffix to discriminate it from other instances of the same gene appearing in the names (e.g., lost_ENSG00000115756_5). This prevents duplicate names for lost or missing copies of the same ortholog.
Note
Note that reference genes that are entirely deleted (lost) or completely missing (e.g. the entire gene overlaps an assembly gap) in the query are not annotated. Hence, there is no query gene to be named after them.
To separate them from the intact query genes, genes inferred from the inactivated projections are marked with a prefix, which might differ depending on the highest projection loss status in this locus:
- A query gene classified as Missing gets the
missing_prefix in the name. - A query gene classified as Lost gets the
lost_prefix in the name.
Note
Note: By default, TOGA2 considers the following loss categories as functionally intact: Fully Intact (FI), Intact (I), Partially Intact (PI), and Uncertain Loss (UL).
This can be changed with the --accepted_loss_symbols flag. For example, if UL is excluded from --accepted_loss_symbols, then a query gene with the highest projection status UL will also get the lost_ prefix.
query_genes.tsv:
lost_ENSG00000115756_5 XM_047444095.1#HPCAL1#27238
query_genes.bed:
chr12 23568523 23573957 lost_ENSG00000115756_5 0 -
Genes inferred from the paralogous projections exclusively get the paralog_ prefix:
query_genes.tsv:
paralog_XM_011546186 XM_011546186.2#LOC124900597#183625#paralog
query_genes.tsv:
chr1 12120804 12120959 paralog_XM_011546186 0 -
Retrogene candidates get the retro_ prefix. Note that retrogenes here imply processed pseudogene projections with loss status of Fully Intact (FI) or Intact (I); processed pseudogenes with other loss statuses are not considered in the gene inference and are annotated in the processed_pseudogenes.bed file instead.
query_genes.tsv:
retro_ENSG00000029993 ENST00000325307.12#HMGB3#13785#retro
query_genes.bed:
chr1 4880048 4880651 retro_ENSG00000029993 0 -
Tip
Just like with projections, you can use gene prefixes for quick results analysis. For example, the following one-liner gets the number of functional orthologous genes:
grep -Ev "(retro)|(paralog)|(lost)|(missing)_” query_genes.tsv | cut -f1 | sort -u | wc -l