GAP

Genome Annotation Paser (GAP) is a tool for parsing Gencode/Ensembl GTF and NCBI GFF3 genome annotation files.

Installation

Prerequisites

Python 2.7 or Python 3

Installing

Add these lines to your ~/.bash_profile, [GAP] is the absolute path of GAP.

export PYTHONPATH=[GAP]:$PYTHONPATH
export PATH=[GAP]:$PATH

Main functions

Fetch transcritome file from genome with GFF/GTF annotation;
Conversion between three coordinate systems: Gene/Transcript/Genome;
Get transcription information (chrID, gene type, length, CDS region...) given a transcription ID or gene name;
Show mRNA structure
...

Download data

Gencode GTF/Genome: https://www.gencodegenes.org
Enesembl GTF/Genome: https://ensembl.org/info/data/ftp/index.html
NCBI GFF3/Genome: ftp://ftp.ncbi.nlm.nih.gov/genomes

Quick start

1. Convert a GTF file to *.genomeCoor.bed file

parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa

The genome file --genome chicken_ensembl.fa is optional.

It will produce three files:

chicken.genomeCoor.bed -- A simple version of the genome-based annotation file
chicken.transCoor.bed -- A simple version of the transcript-based annotation file
chicken_transcriptome.fa -- Transcriptome file

2. Read the annotation

import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")

Another way is to read the GTF file directly.

chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')

3. Get all mRNAs

mRNAs = chicken_parser.getmRNATransList(); print len(mRNAs)
# 30252
print mRNAs[:5]
# ['ENSGALT00000005443', 'ENSGALT00000005447', 'ENSGALT00000013873', 'ENSGALT00000001353', 'ENSGALT00000001352']

It shows that chicken has 30252 mRNA transcripts.

4. Get GAPDH gene id and transcripts

GAPDH_gID = chicken_parser.getGeneByGeneName("GAPDH"); print GAPDH_gID
# ENSGALG00000014442

GAPDH_transcripts = chicken_parser.getTransByGeneID(GAPDH_gID); print GAPDH_transcripts
# ['ENSGALT00000046744', 'ENSGALT00000086833', 'ENSGALT00000023323', 'ENSGALT00000086032', 'ENSGALT00000074237', 'ENSGALT00000090208', 'ENSGALT00000051222', 'ENSGALT00000054080', 'ENSGALT00000089752', 'ENSGALT00000085687']

## Print all transcript gene type and length
for tid in GAPDH_transcripts:
	trans_feature = chicken_parser.getTransFeature(tid)
	print tid, trans_feature['gene_type'], trans_feature['trans_len']

# ENSGALT00000046744 protein_coding 1076
# ENSGALT00000086833 protein_coding 1122
# ENSGALT00000023323 protein_coding 1288
# ENSGALT00000086032 protein_coding 1302
# ENSGALT00000074237 protein_coding 1091
# ENSGALT00000090208 protein_coding 670
# ENSGALT00000051222 protein_coding 1179
# ENSGALT00000054080 protein_coding 1498
# ENSGALT00000089752 protein_coding 434
# ENSGALT00000085687 protein_coding 991

5. Get UTR and CDS region of longest transcript of GAPDH

## Method 1
ft = chicken_parser.getTransFeature("ENSGALT00000054080")
cds_start, cds_end = ft['cds_start'], ft['cds_end']
GAPDH = chicken_parser.getTransSeq("ENSGALT00000054080")

UTR_5 = GAPDH[:cds_start-1]
CDS = GAPDH[cds_start-1:cds_end]
UTR_3 = GAPDH[cds_end:]

## Method 2
print chicken_parser.showRNAStructure("ENSGALT00000054080")

6. Get genome coordination of GAPDH start codon

chrID, chrPos, strand = chicken_parser.transCoor2genomeCoor("ENSGALT00000054080", cds_start)
print chrID, chrPos, strand
# ['1', 76953317, '+']

It shows that GAPDH start codon is located at 76953317 of positive strand of chromosome 1

7. Find and label all GGAC motifs in GAPDH

## Get Sequence
seq = chicken_parser.getTransSeq("ENSGALT00000054080")

## Collect motif sites
locs = []
start = 0
while 1:
    start = seq.find("GGAC", start+1)
    if start == -1: break
    locs.append(start)

## Label
for loc in locs:
    print str(loc)+"\t"+chicken_parser.labelRNAPosition("ENSGALT00000054080", [loc,loc])

Subjects

Init a GAP object

There are two ways to init a GAP object:

Convert the GTF or GFF3 file to a *.genomeCoor.bed file and read it;

Read the GTF or GFF3 directly.

Method 1 -- Parse GTF/GFF3 file

parseGTF.py -g chicken.gtf -o chicken -s ensembl --genome chicken_ensembl.fa

The --genome is optional.

import GAP
chicken_parser = GAP.init("chicken.genomeCoor.bed", "chicken_transcriptome.fa")

Method 2 -- Read GTF/GFF3 directly

import GAP
chicken_parser = GAP.initGTF("chicken.gtf", genomeFile="chicken_ensembl.fa", source='Ensembl')

parseGTF.py
combineNCBIGenome.py
GAP.init
GAP.initGTF
addSeq

Get transcript or gene information

getTransFeature
getGeneParser
getGeneIntron
getGeneExon
getGeneCombinedIntronExon

Coordinate transformation

geneCoor2genomeCoor
genomeCoor2geneCoor
genomeCoor2transCoor
transCoor2genomeCoor
transCoor2geneCoor
geneCoor2transCoor

Get transcript list or gene list

getTransList
getGeneList
getmRNATransList
getmRNAGeneList
getLenSortedTransDictForGenes
getTransByGeneID
getGeneByGeneName
getTransByGeneName
getTransByGeneType

Get transcript sequence and mRNA structure

getTransSeq
showRNAStructure
labelRNAPosition
getRNAPosition

Excutable files

parseGTF.py

Features

Parse GTF/GFF3 file and produce *.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa files

Parameters

-g         genome annotation, Gencode/Ensembl GTF file or NCBI GFF3
-o         output file prefix
-s         [ensembl|gencode|ncbi] data source

optional:
--genome   fetch transcriptome from genome file, produce a prefix_transcriptome.fa file
--noscaffold	Remove scaffolds. Scaffolds are defined as those chromosomes with id length > 6 and not startswith chr and NC_
--rawchr	Use raw chromosome ID, don't convert NC_* to chrXX. Only useful for NCBI source GFF3

Output

*.genomeCoor.bed, *.transCoor.bed and *_transcriptome.fa

combineNCBIGenome.py

Features

Combine single chromosomes download from NCBI. Convert the NC_* code to chr* code
NC_006088.5 => chr1

Parameters

./combineNCBIGenome.py chr1.fasta chr2.fasta chr3.fasta... > outFile.fa

Return

A combined genome file with chr* code

Functions

GAP.init(genomeCoorBedFile, seqFn="", showAttr=True, rem_tVersion=False, rem_gVersion=False)

Features

Init a GAP object from *.genomeCoor.bed file

Parameters

genomeCoorBedFile   -- A *.genomeCoor.bed file produced by parseGTF.py
seqFn               -- Transcriptome fasta file produced by parseGTF.py
showAttr            -- Show an example
rem_tVersion        -- Remove version information. ENST000000022311.2 => ENST000000022311
rem_gVersion        -- Remove version information. ENSG000000022311.2 => ENSG000000022311

Return

GAP object

initGTF(AnnotationGTF, source, genomeFile="", showAttr=True, rem_tVersion=False, rem_gVersion=False, verbose=False)

Features

Init a GAP object from GTF/GFF3 file

Parameters

AnnotationGTF       -- Ensembl/Gencode GTF file or NCBI GFF3 file
genomeFile          -- Genome file
source              -- Gencode/Ensembl/NCBI
showAttr            -- Show an example
rem_tVersion        -- Remove version information. ENST000000022311.2 => ENST000000022311
rem_gVersion        -- Remove version information. ENSG000000022311.2 => ENSG000000022311
verbose             -- Show process information

Return

GAP object

addSeq(seqFileName, remove_tid_version=False)

Features

If the sequence is not provided when init a GAP object, add the sequence to it.

Parameters

seqFileName         -- Transcriptome fasta file produced by parseGTF.py
remove_tid_version  -- Remove version information. ENST000000022311.2 => ENST000000022311

Return

No

getTransFeature(transID, showAttr=False, verbose=True)

Features

Get transcript features given the transcript id

Parameters

transID             -- Transcript ID
showAttr            -- Show an example
verbose             -- Print the warning information when transcript not found

Return

Return a dictionary:
    chr             -- Genome chromosome
    strand          -- + or -
    start           -- Genome start
    end             -- Genome end
    
    gene_name       -- Gene symbol
    gene_id         -- Gene id
    gene_type       -- Gene type
    
    trans_len       -- Transcript length
    
    utr_5_start     -- Transcript-based start site of 5'UTR
    utr_5_end       -- Transcript-based end site of 5'UTR
    cds_start       -- Transcript-based start site of CDS
    cds_end         -- Transcript-based end site of CDS
    utr_3_start     -- Transcript-based start site of 3'UTR
    utr_3_end       -- Transcript-based end site of 3'UTR
    
    exon_str        -- Genome-based exon string

getGeneParser(showAttr=True)

Features

Get gene parser with gene informations

Parameters

showAttr            -- Show an example

Return

Return a list of dictionaries:
    { geneID => { ... }, ... } includes
        chr         -- Genome chromosome
        strand      -- + or -
        start       -- Genome start
        end         -- Genome end
        
        gene_name   -- Gene symbol
        gene_type   -- All transcript types
        
        length      -- Gene length (end-start+1)
        
        transcript  -- All transcripts belong to this gene

getGeneIntron(geneID)

Features

Get genome-based intron corrdinates of given gene

Parameters

geneID              -- Gene id

Return

Get introns of all transcripts of this gene:
            { transID => [[intron1_start, intron1_end], [intron2_start, intron2_end], [intron3_start, intron3_end]],... }

getGeneExon(geneID)

Features

Get genome-based exon corrdinates of given gene

Parameters

geneID              -- Gene id

Return

Get exons of all transcripts of this gene:
            { transID => [[exon1_start, exon1_end], [exon2_start, exon2_end], [exon3_start, exon3_end]...],... }

geneCoor2genomeCoor(geneID, pos)

Features

Convert gene-based coordinates to genome-based coordinates

Parameters

geneID              -- Gene id
pos                 -- Gene-based position

Return

Return: [chrID, chrPos, Strand]

genomeCoor2geneCoor(chrID, start, end, strand)

Features

Convert genome-based coordinates to gene-based coordinates

Parameters

chrID               -- Chromosome id
start               -- Chromosome-based start position
end                 -- Chromosome-based end position
strand              -- Chromosome strand

Return

Return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]

genomeCoor2transCoor(chrID, start, end, strand)

Features

Convert genome-based coordinates to transcript-based coordinates

Parameters

chrID               -- Chromosome id
start               -- Chromosome-based start position
end                 -- Chromosome-based end position
strand              -- Chromosome strand

Return

Return [ [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]

transCoor2genomeCoor(transID, pos)

Features

Convert transcript-based coordinates to genome-based coordinates

Parameters

transID             -- Transcript id
pos                 -- Transcript-based position

Return

Return [chrID, chrPos, Strand]

geneCoor2transCoor(geneID, start, end)

Features

Convert gene-based coordinates to transcript-based coordinates

Parameters

geneID              -- Gene id
start               -- Gene-based start position
end                 -- Gene-based end position

Return

Return  [chrID, chrStart, chrEnd, transID, transStart, transEnd], ... ]

transCoor2geneCoor(transID, start, end)

Features

Convert transcript-based coordinates to gene-based coordinates

Parameters

transID             -- Transcript id
start               -- Transcript-based start position
end                 -- Transcript-based end position

Return

return [ [chrID, chrStart, chrEnd, geneID, geneStart, geneEnd], ... ]

getTransList()

Features

Return list of all transcript id

Parameters

No

Return

return [ transID1, transID2, ... ]

getGeneList()

Features

Return list of all gene id

Parameters

No

Return

return [ geneID1, geneID2, ... ]

getmRNATransList()

Features

Return list of all mRNA id

Parameters

No

Return

return [ mRNAID1, mRNAID2, ... ]

getmRNAGeneList()

Features

Return list of all mRNA gene id

Parameters

No

Return

return [ mRNAGeneID1, mRNAGeneID2, ... ]

getLenSortedTransDictForGenes(only=[])

Features

Return a dictionary of geneID to sorted transID list

Parameters

only                --  Contraint transcript types, such as mRNA, snRNA ...

Return

return { geneID => [ transID1, transID2, ... ], ...}
transcripts are sorted by length

getTransByGeneID(geneID)

Features

Return transcripts belong to specific gene

Parameters

geneID              -- Gene id

Return

return [ transID1, transID2... ]

getGeneByGeneName(geneName)

Features

Return gene id with specific gene name

Parameters

geneName            -- Gene symbol

Return

return geneID

getTransByGeneName(geneName)

Features

Return transcripts belong to specific gene

Parameters

geneName            -- Gene symbol

Return

return [ transID1, transID2... ]

getTransByGeneType(geneType)

Features

 Return a list of transcripts belong to specific gene type

Parameters

geneType            -- Gene type

Return

return [ transID1, transID2... ]

getTransSeq(transID)

Features

 Return transcript sequence

Parameters

transID             -- Transcript id

Return

return sequence

getGeneCombinedIntronExon(geneID, verbose=True)

Features

Parse gene intron/exon regions:
    1. Combine exons from all transcripts, they are defined as exon regions
    2. Remove the exon regions from gene regions, they are defined as intron regions

Parameters

geneID          -- Gene id
verbose         -- Print error information when error occured

Return

return: [intron_regions, exon_regions]

showRNAStructure(transID)

Features

Return string of mRNA sequence with color labeled its UTR/CDS/Codons

Parameters

transID         -- Transcript id

Return

return: colored sequence

labelRNAPosition(transID, region, bn=None, bw=None)

Features

Return string of mRNA sequence with color labeled its UTR/CDS/Codons

        return RNA structure and label the region
        -------|||||||||||||||||--------------------------
         5'UTR        CDS                3'UTR

Parameters

transID         -- Transcript id
region          -- [start, end]
bn              -- Bin number (default: 50)
bw              -- Bin width
bw and bn cannot be specified at the same time

Return

return: -------|||||||||||||||||--------------------------

getRNAPosition(transID, region)

Features

Get the position of region located in mRNA

Parameters

transID         -- Transcript id
region          -- [start, end]

Return

Return any one of:
    -> not_mRNA
    -> 5UTR
    -> span_5UTR_CDS
    -> CDS
    -> span_CDS_3UTR
    -> 3UTR
    -> span_5UTR_CDS_3UTR
    -> INVALID

Notice

Complete documentation is in doc/Introduction.html

Authors

Li Pan - Programmer - Zhanglab

Files

README.md

Latest commit

History

README.md

File metadata and controls

GAP

Installation

Prerequisites

Installing

Main functions

Download data

Quick start

1. Convert a GTF file to *.genomeCoor.bed file

2. Read the annotation

3. Get all mRNAs

4. Get GAPDH gene id and transcripts

5. Get UTR and CDS region of longest transcript of GAPDH

6. Get genome coordination of GAPDH start codon

7. Find and label all GGAC motifs in GAPDH

Subjects

Init a GAP object

Method 1 -- Parse GTF/GFF3 file

Method 2 -- Read GTF/GFF3 directly

Get transcript or gene information

Coordinate transformation

Get transcript list or gene list

Get transcript sequence and mRNA structure

Excutable files

parseGTF.py

Features

Parameters

Output

combineNCBIGenome.py

Features

Parameters

Return

Functions

GAP.init(genomeCoorBedFile, seqFn="", showAttr=True, rem_tVersion=False, rem_gVersion=False)

Features

Parameters

Return

initGTF(AnnotationGTF, source, genomeFile="", showAttr=True, rem_tVersion=False, rem_gVersion=False, verbose=False)

Features

Parameters

Return

addSeq(seqFileName, remove_tid_version=False)

Features

Parameters

Return

getTransFeature(transID, showAttr=False, verbose=True)

Features

Parameters

Return

getGeneParser(showAttr=True)

Features

Parameters

Return

getGeneIntron(geneID)

Features

Parameters

Return

getGeneExon(geneID)

Features

Parameters

Return

geneCoor2genomeCoor(geneID, pos)

Features

Parameters

Return

genomeCoor2geneCoor(chrID, start, end, strand)

Features

Parameters

Return

genomeCoor2transCoor(chrID, start, end, strand)

Features

Parameters

Return

transCoor2genomeCoor(transID, pos)

Features