-
Notifications
You must be signed in to change notification settings - Fork 4
R2Dtool GTF requirements
R2Dtool's annotation, liftover, and downstream plotting functions rely specifically on GTF version 2.2 annotations (also known as revised Ensembl GTFs). R2Dtool uses a custom Rust implementation for parsing GTF2.2 files into transcript and exon data structures, which are used to liftover and annotate RNA feature maps. Where matching annotations are available through Ensembl and an alternative database, we recommend the use of Ensembl annotations, available from the Ensembl FTP site.
GTF features are specified in column 3. For complete functionality, R2Dtool requires on the following feature types being present in the GTF file:
-
transcript
ormRNA
exon
CDS
-
five_prime_utr
or5UTR
-
three_prime_utr
or3UTR
While exon
, transcript
, and UTR
features are mandatory for liftover and annotation purposes, R2Dtool can produce incomplete outputs if CDS
features are missing. This will result in the introduction of NA
outputs for annotation fields where the source data is missing in the GTF file.
GTF attributes are specified in column 9. For complete functionality, R2Dtool requires the following features to be specified for each transcript of interest:
transcript_id
gene_id
gene_name
transcript_biotype*
Note: Transcript biotype can be specified differently depending on GTF source. R2Dtool can accept biotype in one of transcript_biotype
, transcript_type
, or if biotype for a specific transcript is unknown, it is assumed from gene_type
, or gene_biotype
.
transcript_id
is the only attribute required for liftover, but all attributes specified above are necessary for annotation and downstream metaplots.
The transcript_id
specified in the GTF attributes
field must exactly match the reference sequence specified in column 1
of the transcriptome-specified input BED file. It is advisable to use genomes FASTA, transcriptome FASTA, and gene annotation GTF files from the same genome release, or to generate your transcriptome FASTA file, using a matched genome FASTA and GTF file, e.g. using gffread.
In some cases, the FASTA transcript IDs
are appended with a transcript version
, whereas the GTF transcript_id
field doesn't directly contain this information. For example, in the Ensembl 112 GRCh38 cDNA and GTF files:
$ head -n 2 Homo_sapiens.GRCh38.cdna.all.fa
>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
# the FASTA file contains a transcript version ".1" appended to the transcript ID "ENST00000415118"
$ grep "ENST00000415118.1" Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | wc -l
0
# the exact transcript ID with appended version ("ENST00000415118.1") is not found in the GTF file
$ grep "ENST00000415118" Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | head -n 1
14 havana transcript 22438547 22438554 . + . gene_id "ENSG00000223997"; gene_version "1"; transcript_id "ENST00000415118"; transcript_version "1"; gene_name "TRDD1"; gene_source "havana"; gene_biotype "TR_D_gene"; transcript_name "TRDD1-201"; transcript_source "havana"; transcript_biotype "TR_D_gene"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
# omitting the transcript version from the grep search returns GTF entries corresponding to ENST00000415118
# The transcript version is specified in a separate field, 'transcript_version'
To handle cases where a trancsript version is present in the FASTA but missing in the GTF transcript_id
field, the -t
flag can be passed to the r2d annotate
or r2d liftover
commands. This causes R2Dtool to consider the presence of transcript version information in the BED file.
Some trancriptome annotation, including GENCODE annotation, contain sequences that are duplicated between sex chromosomes, called pseudoautosomal regions (PARs). For example, searching for the transcript ENST00000507418 against gencode v39:
$ cat gencode.v39.chr_patch_hapl_scaff.annotation.gtf | grep "ENST00000507418.6" | awk '($3 == "transcript")'
chrX HAVANA transcript 156025664 156027877 . - . gene_id "ENSG00000227159.8"; transcript_id "ENST00000507418.6"; gene_type "unprocessed_pseudogene"; gene_name "DDX11L16"; transcript_type "unprocessed_pseudogene"; transcript_name "DDX11L16-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37115"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000022678.1"; havana_transcript "OTTHUMT00000058841.1";
chrY HAVANA transcript 57212184 57214397 . - . gene_id "ENSG00000227159.8_PAR_Y"; transcript_id "ENST00000507418.6_PAR_Y"; gene_type "unprocessed_pseudogene"; gene_name "DDX11L16"; transcript_type "unprocessed_pseudogene"; transcript_name "DDX11L16-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37115"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; tag "PAR"; havana_gene "OTTHUMG00000022678.1"; havana_transcript "OTTHUMT00000058841.1";
An identical transcript is found on chrX and chrY, however, the transcript on chrY ias the "PAR" tag. If an R2Dtool input bed file has positions on "ENST00000507418", R2Dtool will always liftover the sites to the non-PAR transcript, where a matching PAR-tagged trancript also exists.
To ensure that the GTF file only contains supported features, you can use the following AWK command to extract and check unique entries in column 3:
awk '!/^#/ {print $3}' "/path/to/annotation.gtf" | sort | uniq -c | sort -nr
This command filters entries based on expected feature types and lists unique feature types to verify against R2Dtool's requirements.
E.g. for Ensembl's GTF annotation of GRCh38:
(base) [as7425@gadi-login-07 ensembl_release_112]$ awk '!/^#/ {print $3}' Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | sort | uniq -c | sort -nr
1808998 exon
979613 CDS
278220 transcript
226422 three_prime_utr
190230 five_prime_utr
109019 start_codon
101732 stop_codon
70611 gene
131 Selenocysteine
Here, the exon, CDS, transcript and UTR features can be parsed. The codon
, gene
, and Selenocysteine
features will be gracefully ignored.
R2Dtool performs optimally with GTF files from Ensembl, as these typically follow strict formatting guidelines that align well with the tool’s requirements. Users are encouraged to use Ensembl GTFs to minimize compatibility issues and ensure the accuracy of gene annotations.
-
Validation: Validate the GTF file using tools like gffread to ensure adherence to format requirements, especially in terms of feature types and attributes.
-
Custom Annotations: If using custom annotations, verify that gene names in the GTF match those in the FASTA file headers.
-
Ensure the GTF files contain the feature types and attributes suggested above