Query projection & gene notation

The elementary unit of a TOGA2 annotation is a transcript isoform, internally referred to as ’transcript’ in the case of reference annotation and ‘projection’ in case of transcripts annotated in the query.

While the gene-to-isoform correspondence should ideally be known for the reference and provided as input to TOGA2, query genes are inferred by TOGA2 based on ‘same strand, coding exon overlap’, meaning all coding exons of all projections that overlap by at least 1 base on the same strand will be considered to belong to the same gene.

Projection names

A projection refers to a transcript of a reference gene that corresponds to a locus in the query genome. If you attached the gene symbol or identifier to the transcript identifier (transcriptID#geneID, which is recommended), TOGA2 will add the chain ID to create a projection name as transcriptID#geneID#chainID.

Most projections that provided in the final TOGA2 output correspond to a reference transcript and its orthologous query locus (inferred via the alignment chain). In the following special cases, TOGA2 will add another suffix to the projection name:

Paralogous projections. If TOGA2 does not classify any projections for a given transcript as orthologous, it can use paralogous projection chains for annotation. These projections are discarded if they overlap orthologous predictions from different transcript; if such projection survives to the end of the pipeline, TOGA2 assumes that the query locus contains a real coding gene that it fails to assign orthology relation to, and includes it into the final annotation. Paralogous projections are indicated by #paralog suffix: transcriptID#geneID#chainID#paralog
```
chr2	176795719	176811274	NM_001165881.3#ZNF268#22652#paralog	0	+	176795719	176811274	255,160,120	4	31,127,61,2213,	0,11642,11975,13342,
```
Retrogenes. Retrogene candidates are processed pseudogenes that have an intact open reading frame and thus could encode for a functional protein (if the locus is transcribed). By default, TOGA2 annotates processed pseudogenes predictions (output file processed_pseudogenes.bed), but it will only test for an intact reading frame if the --annotate_processed_pseudogenes flag is set. If this flag is set, CESAR alignment and loss status classification is performed, and retrogene candidates (processed pseudogenes classified as Fully Intact or Intact) are added to final annotation files (query_annotation.bed, query_annotation.with_utrs.bed, query_genes.tsv/bed, and UCSC BigBed files). Retrogene candidate projections are indicated with the #retro suffix: transcriptID#geneID#chainID#retro
```
chr17	6131853	6132405	ENST00000331825.11#FTL#129374#retro	0	-	6131853	6132405	0,0,100	1	552,	0,
```
Fragmented projections. In case of fragmented genome assemblies, where a gene is split across several scaffolds, or in case of chaining artifacts, a reference transcript can have multiple orthologous chains, each aligning different reference exons to different query loci. Projections of such fragmented transcripts have multiple comma-separated chain identifiers in their name (instead of only a single chain identifier). Since fragments are often annotated on different scaffolds or different DNA strands, each fragment gets its own entry in the final BED and BigBed files. To make the name of each fragment unique, a numeric identifier is added to each fragment according to the order of exons annotated in these fragments, prepended by a dollar sign: transcriptID#geneID#chainID1,chainID2$fragmentID
```
 chr1	64691390	64691400	XM_011530210.3#DRICH1#239649,25163$1	0	+	64691390	64691400	130,130,130	1	10,	0, ## contains the first projection’s exon
 chr16	16864784	16912108	XM_011530210.3#DRICH1#239649,25163$2	0	-	16864784	16912108	130,130,130	3	188,33,28,	0,17450,47296, ## contains the last projection’s exon
```

Tip

Among other things, projection postfixes facilitate quick results analysis. For example, to get the total number of annotated orthologous transcript, you can use the following or similar command:

grep -Fv “#retro” query_annotation.bed | grp -Fv “#paralog” | cut -f4 | cut -d’$’ -f1 | sort -u | wc -l

Note that you have to split projection names by the dollar sign (‘$’) to account for potential fragmented projections. Conversely, paralogous and retrogene projections are guaranteed not to be fragmented. To get the number of predicted paralogs or retrogenes, do the following:

grep -Fc “#paralog” query_annotation.bed ## or replace “#paralog” with “#retro"

Note

Projection names are assigned at the alignment, meaning that a single transcriptID#geneID#chainID projection identifier corresponds to the same projection in all the alignment step output files (query_annotation.bed, query_genes.tsv, nucleotide.fa(.gz), etc.). Note, however, that #paralog and #retro suffixes are double-checked and fixed at the finalize step. Certain files outside from the top output level (e.g., meta/ directory contents) are not revised at the this step and therefore do not contain these suffixes. To check their data in the meta/ files, strip the additional suffixes first:

echo “transcriptId#chainId#retro” | sed ’s/#retro//g’ | zgrep -f- -Fw meta/exon_meta.tsv.gz
echo “transcriptId#chainId1,chainId2$1” | cut -d’$’ -f1 | zgrep -f- -Fw meta/exon_meta.tsv.gz

Query gene names

Query gene inference

Which transcripts belong to a query gene is inferred from overlapping projections. Two projections that overlap by at least one coding exon base on the same strand are attributed to the same query gene. Importantly, orthologous, paralogous, and retrogene candidate projections cannot be mixed within a single query gene. That means if a paralogous projection overlaps a same-strand orthologous projection by at least one coding base, it is discarded from the final results and is not considered in query gene inference. Likewise, processed pseudogenes (and the subset of retrogene candidates) that overlapping orthologous or paralogous predictions are not considered in query gene inference.

Tip

Thus, gene inference precedence order is: ortholog > paralog > processed_pseudogene; orthologous predictions ‘evict’ paralogs annotated in the same locus, and both orthologs and paralogs cancel the overlapping processed pseudogene predictions.

That means only those paralogous projections and retrogene candidates that remain in the final annotation are considered potential query genes.

Query gene naming

Query gene names are named in two steps:

at the gene_inference step, identified query genes are assigned a technical name in the reg_${num} notation. If you halt execution at this step or anywhere else prior to the finalize step, genes will preserve these names
at the finalize step, the refined query gene predictions get their names after their orthologous reference genes, utilizing the orthology relationship between reference and query based on the orthology_classification.tsv.

Important

Before version 2.0.5, TOGA2 retained reg_${num} names for lost and missing orthologs, paralogs, and processed pseudogenes in the final output files.

Orthologous gene names

Genes inferred from functional (non-lost/missing) orthologous projections get their names based on the results presented in orthology_classification.tsv according to their orthology relationship status.

1:1 orthologs

In the easiest case of single-copy orthologs, the query locus gets its name after its sole ortholog in the reference: orthology_classification.tsv:

ENSG00000000003	ENST00000373020.9#TSPAN6	ENSG00000000003	ENST00000373020.9#TSPAN6#11	one2one

query_genes.tsv:

ENSG00000000003	ENST00000373020.9#TSPAN6#11

query_genes.bed:

chrX	133892697	133898294	ENSG00000000003	0	-

many:1 orthologs

Names for many:1 loci reflect their multiple orthologs in the reference: *For genes with two or three orthologs, the query name is a comma-separated list of reference gene names

For genes with more than three orthologs, a random ortholog’s name followed by ‘plus’ symbol (‘+’) is used as a query gene’s name

orthology_classification.tsv:

## 2-3:one
ENSG00000005075	NM_001371100.1#POLR2J	ENSG00000005075,ENSG00000285437	NM_001371100.1#POLR2J#180	many2one
ENSG00000005075	NM_001393919.1#POLR2J	ENSG00000005075,ENSG00000285437	NM_001393919.1#POLR2J#180	many2one
ENSG00000285437	ENST00000608621.5#POLR2J3	ENSG00000005075,ENSG00000285437	ENST00000608621.5#POLR2J3#1106777	many2one
ENSG00000285437	ENST00000621093.5#POLR2J3	ENSG00000005075,ENSG00000285437	ENST00000621093.5#POLR2J3#1106777	many2one
ENSG00000248333	ENST00000341832.11#CDK11B	ENSG00000248333,ENSG00000008128	ENST00000341832.11#CDK11B#5	many2one
## >3:one
ENSG00000144218	ENST00000672756.2#AFF3	ENSG00000144218+	ENST00000672756.2#AFF3#3	many2one
ENSG00000153107	ENST00000341068.8#ANAPC1	ENSG00000144218+	ENST00000341068.8#ANAPC1#170	many2one
ENSG00000135968	ENST00000309863.11#GCC2	ENSG00000144218+	ENST00000309863.11#GCC2#225	many2one
ENSG00000153201	ENST00000283195.11#RANBP2	ENSG00000144218+	ENST00000283195.11#RANBP2#225	many2one

query_genes.tsv:

## 2-3:one
ENSG00000005075,ENSG00000285437	ENST00000621093.5#POLR2J3#1106777
ENSG00000005075,ENSG00000285437	NM_001371100.1#POLR2J#180
ENSG00000005075,ENSG00000285437	ENST00000608621.5#POLR2J3#1106777
ENSG00000005075,ENSG00000285437	ENST00000292614.10#POLR2J#180
ENSG00000005075,ENSG00000285437	NM_001393919.1#POLR2J#180
## >3:one
ENSG00000144218+	XM_017004841.2#RGPD6#772,918
ENSG00000144218+	XM_017004739.3#RGPD3#13101,750
ENSG00000144218+	XM_047446447.1#GCC2#225
...

query_genes.bed:

## 2-3:one
chr5	136116752	136122777	ENSG00000005075,ENSG00000285437	0	+
## >3:one
chr10	58255565	58493968	ENSG00000144218+	0	+

1:many orthologs

Multiple copies of an individual gene are marked with a lowercase Latin letter preceded by underscore (_a, _b, _c etc). Numeration corresponds to sorted chain identifier order.

orthology_classification.tsv:

ENSG00000002726	ENST00000360937.9#AOC1	ENSG00000002726_a	ENST00000360937.9#AOC1#14	one2many
ENSG00000002726	ENST00000493429.5#AOC1	ENSG00000002726_a	ENST00000493429.5#AOC1#14	one2many
ENSG00000002726	ENST00000360937.9#AOC1	ENSG00000002726_b	ENST00000360937.9#AOC1#12271	one2many

query_genes.tsv:

ENSG00000002726_a	ENST00000493429.5#AOC1#14
ENSG00000002726_a	ENST00000360937.9#AOC1#14
ENSG00000002726_b	ENST00000360937.9#AOC1#12271

query_genes.bed:

chr6	48905191	48908833	ENSG00000002726_a	0	+
chr6	48975142	48977773	ENSG00000002726_b	0	+

many:many orthologs

We use the naming logic for many:1 loci (detailed above), with one key difference. Since a reference gene in a many:many orthology clique can correspond to more than one query locus, occurrence of each reference gene in the query gene names bears copy number preceded by underscore. A schematic representation of gene naming looks as follows:

orthology_classification.tsv:

ENSG00000241149	ENST00000330020.5#ZNF722	ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1	ENST00000330020.5#ZNF722#3716	many2many
ENSG00000197472	ENST00000339986.8#ZNF695	ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1	ENST00000339986.8#ZNF695#17609	many2many
ENSG00000197008	NM_006524.4#ZNF138	ENSG00000241149_1,ENSG00000197472_1,ENSG00000197008_1	NM_006524.4#ZNF138#104978	many2many
ENSG00000196705	ENST00000311048.11#ZNF431	ENSG00000196705_1+	ENST00000311048.11#ZNF431#17271	many2many
ENSG00000196705	NM_001319124.2#ZNF431	ENSG00000196705_1+	NM_001319124.2#ZNF431#17271	many2many
ENSG00000197008	NM_006524.4#ZNF138	ENSG00000152926_2+	NM_006524.4#ZNF138#37436	many2many
ENSG00000197008	ENST00000307355.12#ZNF138	ENSG00000197008_3	ENST00000307355.12#ZNF138#80958	many2many
ENSG00000197008	NM_006524.4#ZNF138	ENSG00000197008_3	NM_006524.4#ZNF138#80958	many2many
ENSG00000197008	ENST00000307355.12#ZNF138	ENSG00000197008_4	ENST00000307355.12#ZNF138#278677	many2many
ENSG00000197008	ENST00000307355.12#ZNF138	ENSG00000197008_5	ENST00000307355.12#ZNF138#85927	many2many

query_genes.tsv:

ENSG00000241149_3	ENST00000330020.5#ZNF722#18792
ENSG00000241149_3	ENST00000421025.3#ZNF679#65572
ENSG00000241149_3	ENST00000343769.6#ZNF93#6075
ENSG00000184635_1,ENSG00000241149_6	ENST00000330020.5#ZNF722#172738
ENSG00000184635_1,ENSG00000241149_6	ENST00000343769.6#ZNF93#2432
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	XM_047439655.1#ZNF682#953984
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	ENST00000397165.7#ZNF682#953984
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	XM_047439805.1#LOC124904792#12602
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	ENST00000330020.5#ZNF722#9367
ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	XM_047416470.1#LOC124900642#332773
...

query_genes.bed:

chr6	131083603	131104342	ENSG00000241149_3,ENSG00000197472_1,ENSG00000197008_5	0	-
chr1	117741369	117802425	ENSG00000184635_1,ENSG00000241149_6	0	+
chr13	67442304	67449476	ENSG00000241149_2,XM_047416470_2,ENSG00000197124_1	0	-
...

Non-intact and non-orthologous genes

These genes obtain their names just like the functional orthologs; however, to discriminate these genes from functional orthologs, their status is listed in the prefix of their name.

Important

Prefixes designate the status of query gene, not the individual reference genes whose projection contributed to this gene inference. For example, a name lost_ENSG00000135314,ENSG00000256980 describes a lost query gene annotated based on projections from reference genes ENSG00000135314 and ENSG00000256980, not a lost orthologous copy ENSG00000135314 + intact ortholog of ENSG00000256980.

Lost and missing orthologs

Other genes classified as lost or missing are named like the 1:1 and many:1 orthologous projections (detailed above). In any case, a reference gene name appears more than once in the lost/missing query gene names, it gets a _${num} suffix to discriminate it from other instances of the same gene appearing in the names (e.g., lost_ENSG00000115756_5). This prevents duplicate names for lost or missing copies of the same ortholog.

Note

Note that reference genes that are entirely deleted (lost) or completely missing (e.g. the entire gene overlaps an assembly gap) in the query are not annotated. Hence, there is no query gene to be named after them.

To separate them from the intact query genes, genes inferred from the inactivated projections are marked with a prefix, which might differ depending on the highest projection loss status in this locus:

A query gene classified as Missing gets the missing_ prefix in the name.
A query gene classified as Lost gets the lost_ prefix in the name.

Note

Note: By default, TOGA2 considers the following loss categories as functionally intact: Fully Intact (FI), Intact (I), Partially Intact (PI), and Uncertain Loss (UL). This can be changed with the --accepted_loss_symbols flag. For example, if UL is excluded from --accepted_loss_symbols, then a query gene with the highest projection status UL will also get the lost_ prefix.

query_genes.tsv:

lost_ENSG00000115756_5	XM_047444095.1#HPCAL1#27238

query_genes.bed:

chr12	23568523	23573957	lost_ENSG00000115756_5	0	-

Paralogs

Genes inferred from the paralogous projections exclusively get the paralog_ prefix:

query_genes.tsv:

paralog_XM_011546186	XM_011546186.2#LOC124900597#183625#paralog

query_genes.tsv:

chr1	12120804	12120959	paralog_XM_011546186	0	-

Retrogenes

Retrogene candidates get the retro_ prefix. Note that retrogenes here imply processed pseudogene projections with loss status of Fully Intact (FI) or Intact (I); processed pseudogenes with other loss statuses are not considered in the gene inference and are annotated in the processed_pseudogenes.bed file instead. query_genes.tsv:

retro_ENSG00000029993	ENST00000325307.12#HMGB3#13785#retro

query_genes.bed:

chr1	4880048	4880651	retro_ENSG00000029993	0	-

Tip

Just like with projections, you can use gene prefixes for quick results analysis. For example, the following one-liner gets the number of functional orthologous genes:

grep -Ev "(retro)|(paralog)|(lost)|(missing)_” query_genes.tsv | cut -f1 | sort -u | wc -l

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query projection & gene notation

Projection names

Query gene names

Query gene inference

Query gene naming

Orthologous gene names

1:1 orthologs

many:1 orthologs

1:many orthologs

many:many orthologs

Non-intact and non-orthologous genes

Lost and missing orthologs

Paralogs

Retrogenes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Clone this wiki locally