Skip to content

R2Dtool GTF requirements

pre-mRNA edited this page Jun 25, 2024 · 1 revision

Selecting the right GTF file to use with R2Dtool

R2Dtool's annotation, liftover, and downstream plotting functions rely specifically on GTF version 2.2 annotations (also known as revised Ensembl GTFs). R2Dtool uses a custom Rust implementation for parsing GTF2.2 files into transcript and exon data structures, which are used to liftover and annotate RNA feature maps. Where matching annotations are available through Ensembl and an alternative database, we recommend the use of Ensembl annotations, available from the Ensembl FTP site.

Essential GTF feature types and attributes:

GTF features are specified in column 3. For complete functionality, R2Dtool requires on the following feature types being present in the GTF file:

  • transcript or mRNA
  • exon
  • CDS
  • five_prime_utr or 5UTR
  • three_prime_utr or 3UTR

While exon, transcript, and UTR features are mandatory for liftover and annotation purposes, R2Dtool can produce incomplete outputs if CDS features are missing. This will result in the introduction of NA outputs for annotation fields where the source data is missing in the GTF file.

GTF attributes are specified in column 9. For complete functionality, R2Dtool requires the following features to be specified for each transcript of interest:

  • transcript_id
  • gene_id
  • gene_name
  • transcript_biotype*

Note: Transcript biotype can be specified differently depending on GTF source. R2Dtool can accept biotype in one of transcript_biotype, transcript_type, or if biotype for a specific transcript is unknown, it is assumed from gene_type, or gene_biotype.

transcript_id is the only attribute required for liftover, but all attributes specified above are necessary for annotation and downstream metaplots.

Linking gene models to transcript isoforms, and handling transcript version:

The transcript_id specified in the GTF attributes field must exactly match the reference sequence specified in column 1 of the transcriptome-specified input BED file. It is advisable to use genomes FASTA, transcriptome FASTA, and gene annotation GTF files from the same genome release, or to generate your transcriptome FASTA file, using a matched genome FASTA and GTF file, e.g. using gffread.

In some cases, the FASTA transcript IDs are appended with a transcript version, whereas the GTF transcript_id field doesn't directly contain this information. For example, in the Ensembl 112 GRCh38 cDNA and GTF files:

$ head -n 2 Homo_sapiens.GRCh38.cdna.all.fa 

>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
# the FASTA file contains a transcript version ".1" appended to the transcript ID "ENST00000415118"

$ grep "ENST00000415118.1" Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | wc -l 
0 
# the exact transcript ID with appended version ("ENST00000415118.1") is not found in the GTF file

$ grep "ENST00000415118" Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | head -n 1 
14      havana  transcript      22438547        22438554        .       +       .       gene_id "ENSG00000223997"; gene_version "1"; transcript_id "ENST00000415118"; transcript_version "1"; gene_name "TRDD1"; gene_source "havana"; gene_biotype "TR_D_gene"; transcript_name "TRDD1-201"; transcript_source "havana"; transcript_biotype "TR_D_gene"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";
# omitting the transcript version from the grep search returns GTF entries corresponding to ENST00000415118
# The transcript version is specified in a separate field, 'transcript_version'

To handle cases where a trancsript version is present in the FASTA but missing in the GTF transcript_id field, the -t flag can be passed to the r2d annotate or r2d liftover commands. This causes R2Dtool to consider the presence of transcript version information in the BED file.

Handling of Duplicate Transcripts:

Some trancriptome annotation, including GENCODE annotation, contain sequences that are duplicated between sex chromosomes, called pseudoautosomal regions (PARs). For example, searching for the transcript ENST00000507418 against gencode v39:

$ cat gencode.v39.chr_patch_hapl_scaff.annotation.gtf | grep "ENST00000507418.6" | awk '($3 == "transcript")' 
chrX    HAVANA  transcript      156025664       156027877       .       -       .       gene_id "ENSG00000227159.8"; transcript_id "ENST00000507418.6"; gene_type "unprocessed_pseudogene"; gene_name "DDX11L16"; transcript_type "unprocessed_pseudogene"; transcript_name "DDX11L16-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37115"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTHUMG00000022678.1"; havana_transcript "OTTHUMT00000058841.1";
chrY    HAVANA  transcript      57212184        57214397        .       -       .       gene_id "ENSG00000227159.8_PAR_Y"; transcript_id "ENST00000507418.6_PAR_Y"; gene_type "unprocessed_pseudogene"; gene_name "DDX11L16"; transcript_type "unprocessed_pseudogene"; transcript_name "DDX11L16-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37115"; ont "PGO:0000005"; tag "basic"; tag "Ensembl_canonical"; tag "PAR"; havana_gene "OTTHUMG00000022678.1"; havana_transcript "OTTHUMT00000058841.1";

An identical transcript is found on chrX and chrY, however, the transcript on chrY ias the "PAR" tag. If an R2Dtool input bed file has positions on "ENST00000507418", R2Dtool will always liftover the sites to the non-PAR transcript, where a matching PAR-tagged trancript also exists.

Checking your GTF file:

To ensure that the GTF file only contains supported features, you can use the following AWK command to extract and check unique entries in column 3:

awk '!/^#/ {print $3}' "/path/to/annotation.gtf" | sort | uniq -c | sort -nr 

This command filters entries based on expected feature types and lists unique feature types to verify against R2Dtool's requirements.

E.g. for Ensembl's GTF annotation of GRCh38:

(base) [as7425@gadi-login-07 ensembl_release_112]$ awk '!/^#/ {print $3}' Homo_sapiens.GRCh38.112.chr_patch_hapl_scaff.gtf | sort | uniq -c | sort -nr 
1808998 exon
 979613 CDS
 278220 transcript
 226422 three_prime_utr
 190230 five_prime_utr
 109019 start_codon
 101732 stop_codon
  70611 gene
    131 Selenocysteine

Here, the exon, CDS, transcript and UTR features can be parsed. The codon, gene, and Selenocysteine features will be gracefully ignored.

Compatibility with Ensembl GTFs

R2Dtool performs optimally with GTF files from Ensembl, as these typically follow strict formatting guidelines that align well with the tool’s requirements. Users are encouraged to use Ensembl GTFs to minimize compatibility issues and ensure the accuracy of gene annotations.

Tips for preparing custom GTF files and transcriptomes

  • Validation: Validate the GTF file using tools like gffread to ensure adherence to format requirements, especially in terms of feature types and attributes.

  • Custom Annotations: If using custom annotations, verify that gene names in the GTF match those in the FASTA file headers.

  • Ensure the GTF files contain the feature types and attributes suggested above