Source Code Documentation

INDEX

Class Name	Description	Code
account.c	Description	Code
BackUpGenes.c	Description	Code
beggar.c	Description	Code
BuildAcceptor.c	Description	Code
BuildInitialExons.c	Description	Code
BuildInternalExons.c	Description	Code
BuildORFs.c	Description	Code
BuildSingles.c	Description	Code
BuildSort.c	Description	Code
BuildTerminalExons.c	Description	Code
CookingGenes.c	Description	Code
ComputeStopInfo.c	Description	Code
CorrectExon.c	Description	Code
Dictionary.c	Description	Code
DumpHash.c	Description	Code
FetchSequence.c	Description	Code
genamic.c	Description	Code
geneidc.c	Description	Code
GetSitesWithProfile.c	Description	Code
GetStopCodons.c	Description	Code
manager.c	Description	Code
output.c	Description	Code
PrintExons.c	Description	Code
PrintSites.c	Description	Code
readargv.c	Description	Code
ReadExonsGFF.c	Description	Code
ReadGeneModel.c	Description	Code
ReadParam.c	Description	Code
ReadSequence.c	Description	Code
ReadHSP.c	Description	Code
RecomputePositions.c	Description	Code
RequestMemory.c	Description	Code
ScoreExons.c	Description	Code
SearchEvidenceExons.c	Description	Code
Setratios.c	Description	Code
SortExons.c	Description	Code
SortHSPs.c	Description	Code
SwitchFrames.c	Description	Code
SwitchPositions.c	Description	Code
Translate.c	Description	Code
geneidh.c	Description	Code

account

Description:
To save accounting information such as statistics about amount of predicted sites and exons or (CPU and real) time consumed for the current running. This data will be displayed at the end of the processing (verbose mode).
Briefing:
account* InitAcc()
Allocation of accounting data structure and setting the counters (including the time counter).
void updateTotals(account m, packSites allSites, packSites* allSites_r, packExons* allExons, packExons* allExons_r)
Update stats for every type of sites and exons. Compute the total number of sites and exons (both strands and together) in the input sequence.
void cleanAcc(account* m)
Reset total number of sites and exons counters for the next input sequence.

Back to top

BackupGenes

Description:
Data structures about predicted signals and exons are emptied and fill in in every fragment processing, but some exons (and genes) are needed to be used in the next split and therefore they must be temporally in the arrays of sites and exons into the dumpster. To avoid copying twice or more times the same exon, a hash table (DumpHash) is available to record which exons have been already copied and their address. Gene structure (linked exons) must be preserved between copies of genes.
Briefing:
long IncrMod(long x, long Modulus)
Increase values modulus an input long number.
exonGFF* backupExon(exonGFF* E, exonGFF* Prev, packDump* d)
It saves an exon (properties and sites) into the dumpster. Returns the pointer to the new copy for this exon.
exonGFF* backupGene(exonGFF* E, packDump* d)
It saves the gene (exons, sites, properties) which last exon is E, into the dumpster. For every exon of this gene, check whether it has already been copied or not, by looking up the hash table. Note: Whenever any exon of a gene is found to be in the dumpster, it is not necessary to go on copying the rest of the gene, because, from this point, both genes will be exactly equal (dynamic programming). This is because there is only one best gene finished in this exon, whatever exons have forward. Obviously, this function is recursive due to the need to conserve exon links (to backup one exon, its previous exon must have already been copied, and therefore the address is known).
void BackupGenes(packGenes* pg, int nclass, packDump* d)
To save best partial assembled genes and temporary optimal gene from packGenes structure to be used the next fragment of the input sequence.
void BackupArrayD(packGenes* pg, long accSearch, gparam* gp, packDump* dumpster)
From the set of exons input to genamic (gene assembling algorithm), some ordering functions by donor (right position) are computed for every class or assembling rule in the gene model. Most exons are used in the same iteration where they were produced, but a few of them in the intersection between two fragments of sequence, must be saved to be able to recover the gene assembling process in the next iteration, checking minimum/maximum distance requirements as well.
void cleanGenes(packGenes* pg, int nclass, packDump* dumpster)
Preparing sort-by-donor functions and best partial genes structures for the next DNA sequence (reset counters and pointers).

Back to top

beggar

Description:
Statistics about memory requirements. This option displays the information and stops geneid running.
Briefing:
void beggar(long L)
Estimate the memory needed to run geneid (current configuration)

Back to top

BuildAcceptor

Description:
Statistics about memory requirements. This option displays the information and stops geneid running.
Briefing:
float ComputeExtraProfile(char* s, long positionAcc, long limitRight, profile* p)
Given a predicted acceptor site, search the best BPoint or PPTract juest before the AG signal.
Briefing:
long BuildAcceptors(char* s, profile* p, profile* ppt, profile* bp, site* st, long l1, long l2)
Prediction of acceptor sites, taking into account the associated signals (BPoint, PPTract) to the characteristic AG core.

Back to top

BuildInitialExons

Description:
Initial (first) exon construction: first exons are DNA regions beginning in a start codon (ATG) and finishing in a donor site (GT), not including any stop codon in frame with the translation start.
Briefing:
long BuildInitialExons(site Start, long nStarts, site Donor, long nDonors, site Stop, long nStops, int MaxDonors, char Sequence, exonGFF *Exon)
Every list of signals is sorted by position due to the signal prediction process. For every Start codon, donors between the Start and the first Stop in frame are candidates to be First exons. However, there is a limited maximum amount of exons to build, (parameter MaxDonors), being allocated during the processing in a local array. If there is no room for more exons beginning by this Start, the worst donor will be rejected. The output list of exons is sorted by Start position. According to the nucleotides at the end of the exon, some information is computed to detect possible Stop codons when assembling with other exons.

Back to top

BuildInternalExons

Description:
Internal exon construction: internal exons are DNA regions beginning in a acceptor splice site (AG) and finishing in a donor site (GT), not including any stop codon in frame. Internal exons may have the first and the last codon uncomplete (frame and remainder). Thus, to get the true frame, the length of the uncomplete codon must be added to the left position of the exon so that stops in frame with this new frame can be searched. Two exons (E,F) are allowed to be assembled (E-F) only if remainder(E) + frame(F) = 3, to build a complete codon.
Briefing:
long BuildInternalExons(site Acceptor, long nAcceptors, site Donor, long nDonors, site Stop, long nStops, int MaxDonors, char Sequence, exonGFF* Exon)
Every list of signals is sorted by position due to the signal prediction process. For every couple (Acceptor, Donor), three exons might be built according to the number of frames closed by the Stops following the current Acceptor. There is a limited maximum amount of exons to build, (parameter MaxDonors per frame), being allocated during the processing in a local array. Moreover, a limited minimum length required is provided (EXONLENGTH). If there is no room for more exons beginning by this Acceptor, the worst donor will be rejected. The output list of exons is sorted by Acceptor position. According to the nucleotides at the end of the exon, some information is computed to detect possible Stop codons when assembling with other exons.

Back to top

BuildORFs

Description:
Open Reading Frames (ORFs) construction: ORFs are DNA regions beginning in one start codon (ATG), being as long as it is possible, finishing with a stop codon, and not including any other previous stop codon in frame.
Briefing:
long BuildORFs(site Start, long nStarts, site Stop, long nStops, long cutPoint, exonGFF *Exon)
Every list of signals is sorted by position due to the signal prediction process. For every start codon, looking for the first stop codon in frame to build the ORF. The CutPoint value is necessary to keep the order between the set of exons predicted on one fragment and the set predicted on the next one. ORFLENGTH value checks the minimal length of a single gene. The output list of exons is sorted by Start codon position.

Back to top

BuildSingles

Description:
Single genes construction: single gene exons are DNA regions beginning in one start codon (ATG) and finishing in a stop codon in frame (TGA/TAG/TAA), not including any other previous stop codon in frame.
Briefing:
long BuildSingles(site Start, long nStarts, site Stop, long nStops, long cutPoint, exonGFF *Exon)
Every list of signals is sorted by position due to the signal prediction process. For every start codon, looking for the first stop codon in frame to build the single gene. The CutPoint value is necessary to keep the order between the set of exons predicted on one fragment and the set predicted on the next one. SINGLEGENELENGTH value checks the minimal length of a single gene. The output list of exons is sorted by Start codon position.

Back to top

BuildSort

Description:
Given one exon, the main goal is to find the best gene finished in that exon (dynamic programming), looking for every exon which may be placed before it, according to the assembling rules expressed in the gene model, and selecting the best one (higher gene score). Given a rule or gene class, the set of exons that may be used at the left part of the rule are stored, sorted by donor (right) position, in the d-array structure (plus extra array of counters). This module sorts by donor the input exons according to every class rules and the type of exons. As the set of input exons were initially ordered by acceptor, donor re-arrangements are very similar and preserves quite well the original order.
Briefing:
void BuildSort(dict* D, int nc[], int ne[], int UC[][MAXENTRY], int DE[][MAXENTRY], int nclass, long km[], exonGFF* *d, exonGFF E, long nexons)
Essentially, for every exon and the set of available rules or classes (in which its type is at the left part): to insert into the sorting function by running the sorting algorithm by insertion. (Note: As input exons are sorted by acceptor and this order is partially maintained in the arrangement by donor, there will be many simple insertion operations and few operations of shifting in the d-arrays (linear time cost)).

Back to top

BuildTerminalExons

Description:
Terminal exon construction: terminal exons are DNA regions beginning in one acceptor site (AG) and finishing in a stop codon in frame (TGA/TAG/TAA), not including any other previous stop codon in frame.
Briefing:
long BuildTerminalExons (site Acceptor, long nAcceptors, site Stop, long nStops, long LengthSequence, exonGFF* Exon, long cutPoint)
Every list of signals is sorted by position due to the signal prediction process. For every acceptor codon there are three possible exons to build depending on the corresponding stop in frame exists. The CutPoint value is necessary to keep the order between the set of exons predicted on one fragment and the set predicted on the next one. The output list of exons is sorted by Acceptor site position.

Back to top

CookingGenes

Description:
Terminal exon construction: terminal exons are DNA regions beginning in one acceptor site (AG) and finishing in a stop codon in frame (TGA/TAG/TAA), not including any other previous stop codon in frame.
Briefing:
long BuildTerminalExons (site Acceptor, long nAcceptors, site Stop, long nStops, long LengthSequence, exonGFF* Exon, long cutPoint)
Every list of signals is sorted by position due to the signal prediction process. For every acceptor codon there are three possible exons to build depending on the corresponding stop in frame exists. The CutPoint value is necessary to keep the order between the set of exons predicted on one fragment and the set predicted on the next one. The output list of exons is sorted by Acceptor site position.

Back to top

ComputeStopInfo

Description:
Genes are series of connected exons. It sometimes happens that one stop codon is made up by the final nucleotides of the left exon and the last nucleotides of the right one. To avoid this situation, geneid records these features for every exon. Then, during the assembly, this requirement is evaluated.
Briefing:
void ComputeStopInfo(exonGFF* e, char* s)
Record which nucleotides were at the end of the current exon according to the content of the input sequence. Depending on the `lValue` (beginning) and `rValue` (end), during the assembly, a connection will be allowed or not.

Back to top

CorrectExon

Description:
Fixing the coordinates for predicted exons. Arrays and most processing in language C start in 0 to L-1, while positions of predictions in the real sequence ranges from 1 to L. Moreover, stop codons are included into the Terminal exons, Single Genes and ORFs.
Briefing:
void CorrectExon(exonGFF* e)
Increase/decrease positions (offset1, offset2) due to the C offset in the arrays. Include stop codon in Terminal exons and Single genes.
void CorrectORF(exonGFF* e)
Increase/decrease positions (offset1, offset2) due to the C offset in the arrays. Stop codon included.

Back to top

Dictionary

Description:
This module implements a look-up table or dictionary to simulate associative arrays by using a hash table that binds unique integer keys to gene features such as "First" or "Internal" (combined with the strand "+" or "-"). There is enough room for MAXENTRY different features.
Briefing:
void resetDict(dict* d)
Reset the counter nextFree (assign new key) and NULL pointers in the hash table of sinonimous.
int f(char s[])
Hash function: weighted sum of chars of the input string modulus the length of the hash table to get an integer between 0..MAXENTRY-1.
int setkeyDict(dict* d, char s[])
Assign a new key for the input string if it was not into the hash table. Management of computing hash function and colissions.
int getkeyDict(dict* d, char s[])
Finding a word into the dictionary by computing the hash function and looking up the proper list of nodes in the hash table. NOTFOUND is returned if the word is not found, but the key is returned if it exists.
void showDict(dict* d)
Display the nodes (string,key) stored into the dictionary.
void freeNodes(pnode node)
Set free a list of sinonimous (colissions).
void freeDict(dict* d)
Set recursively free the dictionary hash table of sinonimous.
void setAADict(dict* d, char s[], char aA)
Store the Genetic Code (1 codon : 1 amino acid) in a dictionary.
char getAADict(dict* d, char s[])
Returns the translation of the input codon by using the Genetic Code.

Back to top

DumpHash

Description:
Between one fragment of DNA input sequence and the next one to be processed, some useful information must be saved to go the prediction on: best partial genes and exons still not used in the gene assembling. But between best partial genes, there is a lot of redundancy (the same or similar set of exons). Every exon must be saved only once, not more, so that by using the dumspter (hash table of backup exons), is very easy to know whether one exon has already been copied or not. Exon features as signal positions, exon type or strand are used to bind unique keys to these exons.
Briefing:
void resetDumpHash(dumpHash* h)
Initialize the hash table and set the counter of exons copied.
long fDump(exonGFF* E)
Hash function: integer computed from exon features.
void setExonDumpHash(exonGFF* E, dumpHash* h)
Insert the input exon into the dumpster hash table after have been copied.
exonGFF* getExonDumpHash(exonGFF* E, dumpHash* h)
Finding an exon into the hash table to know it must be whether copied or not. If it has already been copied before, the address of the copy is returned.
void freeDumpNodes(dumpNode* node)
Free recursively a list of sinonimous nodes.
void cleanDumpHash(dumpHash *h)
Free all of lists of sinonimous nodes in the hash table.

Back to top

FetchSequence

Description:
Prepare the input sequence (changing lower to upper cases, if needed), producing the reverse sense sequence which will be processed in parallel with the original sequence.
Briefing:
int complement(int c)
Return the complementary nucleotide to c.
long FetchSequence(char s, char r)
Running from both ends in the input sequence, reversing and complementing the nucleotides to produce the reverse sense sequence. Return the length of both sequences. Output sequence has been previously allocated.
void ReverseSubSequence(long p1, long p2, char* s, char* r)
Reverse and complement a fragment of DNA from p1 to p2 in the input sequence. Output sequence has been previously allocated.

Back to top

genamic

Description:
Assembling the best gene from an input set of exons by using the algorithm genamic which is based on dynammic programming techniques. The best gene is the series of linked exons which have the highest sum of scores respecting the allowed rules (gene model). Optimal result is guaranteed. Exons must be sorted by acceptor (left, minor) position and within genamic, for every assembling rule (class), exons are also sorted by donor (right, major) position. The goal is to assemble the best gene finished with every exon and then, return the highest score produced gene. Essentially, for every exon in the input, there are 3 possible assemblings: remain alone, join to the gene assembled to the previous input exon or join to the best gene that finishes between the previous input exon and it. Having acceptor and donor sorting functions, every exon (and the associated gene finishing with it) is used only once, and therefore, genamic is a linear time algorithm (respect to the number of input exons). Annotations or evidences may be included among the predicted exons. Group field is then used to allow the mix between ab initio and evidence exons or to force that only exons having same group can be joined using a blocked rule. To block a rule, use the block keyword in the selected rule of the gene model.
Briefing:
void genamic(exonGFF* E, long nExons, packGenes* pg, gparam* gp)
genamic processing: Prepare reverse exons to be used. Sorting by donor the input set of exons. For every exon, look up the dictionary to get the identifier of type. For every class in which this exon is in the downstream part, get the temporary best gene according to class and frame, right now: Checking maximum/minimum distances Updating best temporary gene between previous and current exon Assembling best temporary gene and exon, checking groups Updating best gene pointer Restore reverse exons.

Back to top

geneidc

Description:
Main program: management and control of geneid data flow.
Briefing:
void main (int argc, char *argv[])
geneid main program actions: Initialize accounting structure Read setup options from command line Set ratios (for sites/exons memory allocation) from file size estimation Request memory for main geneid data structures Read parameters file (statistical model to predict genomic elements) [optional]. Read annotations and homology SRs from input file Read first DNA sequence from input file (more than one allowed) Processing sequence: upper case, reverse and complement Divide sequence into overlapped fragments (length = LENGTHSi) For every fragment do: Measure G+C content to select the proper isochore Forward strand predictions: sites, exons (both scored and filtered) Reverse strand predictions: sites, exons (both scored and filtered) Merge forward and reverse predicted exons (and annotations) Integrate external information: evidence and HSPs Sort by left (minor) position the whole set of predicted exons Print genomic elements: sites and exons (selected format) Run genamic algorithm to assemble the best gene possible Save necessary information to go on computing the next fragment Delete the rest of information (to save memory) Print best prediction: multiple gene output (selected format) There is a partial prediction mode: reading exons directly from input file and running assembling algorithm (genamic) to display the best gene predicted

Back to top

GetSitesWithProfile

Description:
Search by signal: prediction of start codons, acceptor splice sites and donor splice sites by using a Position Weighted Array (a position weight matrix where every position is a Markov chain instead of a simple nucleotide distribution function). Predicted signals score must be higher than a fixed cutoff score depending on the type. The recorded position for predicted signals is: A from (ATG) for starts, X from XGT for donors and X from AGX for acceptors.
Briefing:
long GetSitesWithProfile(char* s, profile* p, site* st, long l1, long l2)
Scan the input sequence applying the PWA to every fragment candidate to contain a true signal (length = profile.dimension). Applying the PWA: for every position i, look for the probability of finding the (i-k..i) oligonucleotide in this position, being the candidate a real signal, over the probability being a false signal. In every position, the Markov chain is different, and the core is the set of consecutive positions where the bias is complete (k fixed nucleotides with probability 0 or 1, i.e. the characteristic dinucleotide for donors is GT in the core). If the order of Markov chain is 0 or 1, to look up the Markov string is done directly, while a loop is required for order higher than 1 (trinucleotides and so on). Candidate regions obtaining a higher than cutoff score are inserted into the result list (array). Returned the number of final predicted signals.

Back to top

GetStopCodons

Description:
Search by signal: prediction of stop codons by using a Position Weighted Array (a position weight matrix where every position is a Markov chain instead of a simple nucleotide distribution function). Predicted stops score must be higher than a fixed cutoff score. For every stop the recorded position is the last coding nucleotide before the core TGA\|TAG\|TAA.
Briefing:
long GetStopCodons(char* s, profile* p, site* sc, long l1, long l2)
Scan the input sequence applying the PWA to every fragment candidate to contain a true signal (length = profile.dimension). Applying the PWA: for every position i, look for the probability of finding the (i-k..i) oligonucleotide in this position, being the candidate a real signal, over the probability being a false signal. In every position, the Markov chain is different, and the core is the set of consecutive positions where the bias is complete (k fixed nucleotides with probability 0 or 1). If the order of Markov chain is 0 or 1, to look up the Markov string is done directly, while a loop is required for order higher than 1 (trinucleotides and so on). Candidate regions obtaining a higher than cutoff score are inserted into the result list (array). Before finishing, stop codons at the end of sequence are computed as well on regions smaller than dimension of the profile. Returned the number of final predicted stops.

Back to top

manager

Description:
Control of signal and exon prediction and scoring functions, both in positive and negative strand processing, in a range (l1,l2). Main parameters are the input sequence (forward or reverse), length of sequence, coordinates of the ends of the fragment, the pack of sites and exons to be filled in, scoring model (isochores) and GC information precomputed before about G+C content on that split.
Briefing:
void manager(char Sequence, long LengthSequence, packSites allSites, packExons* allExons, long l1, long l2, int Strand, packExternalInformation* external, packHSP* hsp, gparam* gp, gparam** isochores, int nIsochores, packGC* GCInfo)
Stablishing limits in the fragment for signal prediction (and therefore the exon construction, as well) because of the overlapping processing of fragments in which exons must be predicted following an order and repeated exons are not allowed. Search by signal: start codons, acceptor and donor splice sites and stop codons. Construction of exons: every feasible pair of left and right signals (specific depending on the exon type) will be used to build one exon as long as the parameter Max_donor_per_acceptor (maximum right signals used together with the same left signal to build different exons) in the current isochore is not reached. Scoring exons according to the G+C content in the region around every exon. Due to the pre-processing step done before (packGC), there is no need to scan the sequence for each exon. NOTE: Reverse prediction implies recomputing signal positions according to the forward sense of the original sequence (remember they have been predicted reading the reverse sequence and therefore, having reverse coordinates). Moreover, left and right signals must be exchanged to preserve the property: left signal position < right signal position.

Back to top

output

Description:
Output management: according to the options selected by the user, display the results, information about processing (if verbose), and errors.
Briefing:
void printMess(char* s)
Display information about geneid running (if verbose).
void printRes(char* s)
Display information about geneid results and statistics (if verbose).
void printError(char* s)
Display errors happened during geneid running (aborted).
void printReadingInfo(char* s)
Display number of nucleotides read from input DNA sequence.
void PrintProfile (profile* p, char* signal)
Print Position Weight Array parameters: type, offset, dimension, Markov order and cutoff.
void OutputHeader(char* locus, long l)
Display information about DNA input sequence loaded (length, locus). Output headers depending on the format selected: GFF, XML or geneid.
void Output(packSites* allSites, packSites* allSites_r, packExons* allExons, packExons* allExons_r, exonGFF* exons, long nExons, char* Locus, long l1, long l2, char* Sequence, gparam* gp, dict* dAA)
Display results in the proper format according to user preferences: sites and exons (separated strands) and/or the set of predicted (and sorted) exons.
void OutputGene(packGenes* pg, long nExons, char* Locus, char* Sequence, gparam* gp, dict* dAA)
Output the best predicted genes by calling to CookingGenes routine.
void OutputStats(char* Locus)
Display the overall amount of (every type) predicted sites and exons.
void OutputTime()
Display the total time used to process the input sequence (CPU and user time).

Back to top

PrintExons

Description:
Output predicted exons, selecting the proper format according to the setup options and whether the exon is part of a gene (gff/xml/geneid) or not (gff/geneid).
Briefing:
void PrintExon(exonGFF e, char Name[], char s, dict* dAA)
Print a exon, previously translated into amino acids, using the proper output format taking into account the positions plus offsets (corrections), if exon is not an annotation.
void PrintExons (exonGFF e, long ne, int type, char Name[], long l1, long l2, char Sequence, dict* dAA)
Print a list of exons by using the proper output format.
void PrintGExon(exonGFF e, char Name[], char s, dict* dAA, long ngen, int AA1, int AA2, int nAA)
Print an exon which is part of a gene (identifer, amino acid positions) in gff/geneid format. For annotations, type field is different.
void PrintXMLExon(exonGFF *e, char Name[], long ngen, long nExon, int type1, int type2, int nExons)
Print an exon which is part of a gene by using the XML format.

Back to top

PrintSites

Description:
Print lists of predicted signals (motif, score, position) using the selected output format. There are some parts of the list which will be not output to avoid printing twice the same signal (overlapping of fragments implies repeated computing). Before printing, signal limits are modified to display the positions of the core (AG, GT, ATG or TGA\|TAG\|TAA) as pos1 and pos2. Motifs are always printed using the correct reading sense (+ or -) as positive.
Briefing:
void PrintSite(site* s, int type, char Name[], int Strand, char* seq, profile* p)
Processing information about a signal to print its motif but anchored at the core (ATG, AG, GT or TGA\|TAA\|TAG) by computing the values k and offset (see profiles), and taking into account the COFFSET: everything according to the signal type and format (gff/geneid). Output motif, score and position of the input site.
void PrintSites (site* s, long ns, int type, char Name[], int Strand, long l1, long l2, char* seq, profile *p)
Print a list of sites, but only the signals between "i" and "printPoint" which are values depending on the type of the signals. This is to avoid print the same signal twice in two different fragments due to the overlapping.

Back to top

readargv

Description:
Read the command line options selected by the user, checking incompatibilities among options or problems with the number of input filenames. Every chosen option is represented by raising the corresponding flag (global var).
Briefing:
void printHelp()
Output the list of available options.
void printDTD()
Output the Document Type Definition to validate the XML output (only for best predicted genes).
void readargv (int argc,char* argv[], char* ParamFile, char* SequenceFile, char* ExonsFile, char* HSPFile)
Read the options selected by the user watching incompatibilities (among different options and about number of input filenames), setting the proper global vars (geneid.c) and acquiring the name of different external but optional filenames: parameter, exons (evidences) and similarity to protein regions.

Back to top

ReadExonsGFF

Description:
Read an external file of exons. These annotations will be added in the final gene prediction either forcing to include them or mixing with the ab initio predictions (depending on the existence of value in the group field (gff) of annotations to preserve or not the complete annotated gene). `GFF` format line (tab as field separator): Name: locusname of the input sequence. Source: program used to produce the output. Type: biological feature described here. Begin: left position in the sequence. End right position in the sequence. Score: probability to be a real prediction. Strand: forward or reverse reading sense. Frame: number of nucleotides of the left non complete codon. [group]: The way to group a set of exons forming a gene. `geneid` wildcards: represented with a dot '.' Score: maximum score will be assigned to the annotation. Frame: three exons will be created from this one (3 reading frames). Group: If is present, the set of annotations (exons) with the same group will be preserved as a complete gene as long as they match the gene assembling rules. If not, annotations will be mixed with ab initio predictions as long as they might contribute to be part of the best genes. -- Annotations MUST be ordered by left position (beginning) --
Briefing:
long ReadExonsGFF (char FileName, packEvidence pv, dict* d)
Input exons from an external file. They must be sorted by the left position (increasing order). It returns the number of created annotations. Exons with unknown (not in gene model) features are skipped, displaying one warning for every one.
packEvidence* SelectEvidence(packExternalInformation* external, char* Locus)
Select the group of annotations to be integrated into the predictions according to current Locus name (using a hash table - dictionary of locus).

Back to top

ReadGeneModel

Description:

Acquiring the information to establish the allowed connections between exon types, from the parameter file. Gene model might therefore be modified by the user, changing/adding features, distances or switch group restriction on.

Gene model rules have this form:

U₁ : U₂ : ... : U_n	D₁ : D₂ : ... : D_m	dmin : dmax [block]

where

the list (Upstream Compatible) of types U₁ : U₂ : ... : U_n are features allowed to connect in front of features in the list (Downstream Equivalent) D₁ : D₂ : ... : D_m, as long as they match the distances [dmin,dmax] requirement.
block flag in a rule means the restriction of having the same group is switch on for 2 candidates to connect by using this assembly class.
Maximum distance may be skipped by using the keyword "Infinity".

Briefing:

void shareGeneModel(gparam** isochores, int nIsochores)

To share the gene model information between several isochores by using the same pointers to the dictionary of features and other arrays and copying extra information.

long ReadGeneModel (FILE *file, dict *d,
                    int nc[], int ne[],
                    int UC[][MAXENTRY],
                    int DE[][MAXENTRY],
                    long md[], long Md[],
                    int block[])

Parsing the gene model rules to extract for every feature this information: lines (rules) where it was at the left part (upstream), at the right (downstream). For every rule (class or line), to save the range of distances allowed to connect and whether the restriction of having the same group is raised or not. This information is stored at the dictionary data structure and some extra structures such as UC, DE, nc, ne and distances arrays. If the option -F is activated, a couple of artificial rules with feature sGHOST are introduced to force one complete gene prediction.

Back to top

ReadParam

Description:
Reading the statistical model to predict both signals and exons (scoring functions) and the gene model rules which allowed connections between exons according to their type and the distance between. Data is input from a separate file (parameter file) to an array of isochore data structure.
Briefing:
void readLine(FILE File, char line)
Read one line of numerical values (maximum MAXLINE characters). Skip comment lines (begin by '#') and empty lines.
void readHeader(FILE File, char line)
Read one line (name of some parameter) (maximum MAXLINE characters). Skip comment lines (begin by '#') and empty lines. Headers are textual labels, so user may modify them without being verified the changes.
void ReadProfile(FILE RootFile, profile p, char* signal)
Acquire statistics about signal prediction: definition of the position weight array (length, offset, cutoff and Markov chain order profile) and transition probabilities.
void ReadIsochore(FILE RootFile, gparam gp)
Read and save the information about signal/exon prediction in an specific isochore (DNA region with a well-defined and biased G+C content): C+G min/max percentages cutoff score for every type of exon coding potential score for every type of exon weight of coding potential score against signal score in the total score for every type of exon (from 0 to 1) exon weight parameter (correct the length of predicted genes) for every type of exon profiles (PWAs) to discover splicing and translation signals information about Markov chains measuring the coding potential property in predicted exons (biased distribution of some oligonucleotides in protein coding DNA) (Initial and transition matrices). Max number of donors per acceptor (building exons).
int readparam (char name, gparam* isochores)
Main routine for the management of reading parameter file: identifying the source (command line option P, enviroment var GENEID, default) number of isochores to read loading isochores information read the gene model rules for the assembly of predicted exons

Back to top

ReadSequence

Description:
Reading DNA sequences, placing them in a previously allocated string which will post-processed (reverse and complement) then. More than one sequence is allowed to be in the input file but always respecting the FASTA format.
Briefing:
long analizeFile(char* SequenceFile)
Read the size of the input file to estimate the size of its DNA sequence.
int IniReadSequence(FILE* seqfile, char* line)
Reading the locusname of first DNA sequence in the input file. (Remember this file is allowed to contain more than one fasta sequence)
int ReadSequence (FILE* seqfile, char* Sequence, char* nextLocus)
Reading the current sequence and get the locus name from the next sequence until there are not more sequences in the input file. (Multiple string locus names and dynamic length of fasta lines are allowed)

Back to top

ReadHSP

Description:
Read the external file containing HSPs from sequence alignments. Blast High-scoring Segment Pairs over the input sequence are projected over the input sequence, taking the best value in every position. Positions without homology support are assigned the value `NO_SCORE` (parameter file). By using these homologous regions to the input DNA sequence, exons overlapping any HSP will be enhancered and better predictions will be obtained. HSPs file is in gff format (8 mandatory records plus string group optional). If blastn (DNA vs DNA) is used to obtain the homologous regions, their frame is unknown, so a dot (".") can be used to make three copies from every one. -- It is NOT necessary to provide the HSPs records ordered by any position --
Briefing:
packHSP* SelectHSP(packExternalInformation* external, char* Locus, long LengthSequence)
Select the group of HSPs to be integrated into the predictions according to current Locus name (using a hash table - dictionary of locus). Both FWD and RVS HSPs will be sorted using a quicksort routine.
long ReadHSP (char* FileName, packExternalInformation* external)
To place every HSP into the correct array according to their blast frame (1,2,3) and strand. If frame is unknown (blastn), then 3 copies of the same HSP will be produced. (It IS necessary a previous sorting in the HSP file)

Back to top

RecomputePositions

Description:
Position of predicted signals by reading the reverse (negative) sense on the original input sequence must be translated (normalised) into coordinates in the forward sense (normal) to be mixed with the forward predictions. Reverse exons defined with those signals have the strand property "-" (reverse) to distinguish from forward exons.
Briefing:
void RecomputePositions(packSites* allSites, long l)
Translation from positions (P_-) in reverse sequence into positions (P₊) in forward sequence by computing: P₊ = L - P_- - 1 where L is the length of the input sequence.

Back to top

RequestMemory

Description:
This set of functions are the responsible of memory management, allocating main geneid data structures, from predictions to parameters and statistical models, as well as temporary structures reset in every fragment of the input sequence.
Briefing:
account* RequestMemoryAccounting()
Memory for account data structure.
char* RequestMemorySequence(long L)
Memory for a DNA sequence.
packSites* RequestMemorySites()
Memory for main structure pack of sites (forward or reverse sense) and every type of signal: acceptor and donor splice sites, start and stop in translation.
packExons* RequestMemoryExons()
Memory for pack of exons and every type of exons: NUMEXONS is modified (divided) by RSINGL, RFIRST, RINTER, RTERMI and RORF ratios because some type of exons (internal and terminal) appear more frequently than others.
exonGFF* RequestMemorySortExons()
Memory to sort exons found in the current fragment.
packEvidence* RequestMemoryEvidence()
Memory for input annotations (evidences).
packHSP* RequestMemoryHomology()
Memory for input similarity to protein regions (homology).
packExternalInformation* RequestMemoryExternalInformation()
Memory for external information (dictionary of Locus names)
packGC* RequestMemoryGC()
Memory for the precomputed array of G/C's and N's nucleotides on a DNA subsequence.
gparam* RequestMemoryParams()
Memory for the statistical model of one isochore loading from the parameter file: profiles, Markov initial/transition arrays, exon scoring parameters. Memory for temporary arrays which compute accumulated sum of penta/hexanucleotides in every split.
gparam ** RequestMemoryIsochoresParams()
Memory for the array of isochores and their common information: gene model dictionary and allowed distances arrays.
void RequestMemoryProfile(profile* p)
Memory for every position containing a Markov chain inside in a Position Weight Array (profile).
packGenes* RequestMemoryGenes()
Memory for the main structure, Ghost exon, best partial genes (Ga) and sorting by donor functions (d-array) in every gene class, necessary in genamic.
packDump* RequestMemoryDumpster()
Memory for backup information between two fragments: sites, exons and a hash table useful to save only once every exon and avoid usual redundancy between best partial genes (they share most exons).
dict* RequestMemoryAaDictionary()
Memory for the genetic code, a hash table to translate from predicted genes or exons into proteins.

Back to top

ScoreExons

Description:
This module is implemented to score (to give a measure of reliability) and filter predicted exons. There are 3 different scoring sources which make up the final value for a given exon: score from signals (sites), score from protein coding potential probability and score from provided homology information. Statistical parameters for every type of exon are extracted from parameters file. For protein coding potential, a Markov model of order 5 is employed, supporting different isochores usage for predictions on different G+C content sequences. G+C frequencies and Markov transition scores are computed by using the accumulated sum technique, with a linear cost instead of the usual quadratic value.
Briefing:
int SelectIsochore(float percent, gparam** isochores)
Given a float value (between 0 and 1) representing the G+C value in a DNA region, returns the identifier of the isochore whose coding potential Markov model is adapted to work under this range. Isochores are not supposed to be sorted and range is not verified anywhere so it is strongly recommended to be careful when parameter file is modified.
float ComputeGC(packGC* GCInfo, long inigc, long endgc)
G+C content (percentage): computing step. For every region, subsequence of the original sequence or fragment, the percent of G+C is quickly computed by using the accumulated sum technique: instead of scanning the sequence whenever is necessary to count the G+C percentage in a subsequence of the original input (i.e. exons), it is much more efficient to write down the frequency of a given nucleotide until every position, and then, the absolute frequency for that nucleotide between 2 positions is the rest of both accumulated values (Linear time versus quadratic cost). Unknown nucleotides (N) are not taken into account because sequences sometimes might contain an important amount of them.
void CGScan(char* s, packGC* GCInfo, long l1, long l2)
G+C content (percentage): pre-processing step. Scan the whole sequence, counting how many C/Gs or Ns are in. Then, resting the accumulated values stored in any two positions (i.e. start and end of a subsequence) divided by the rest of these values (i.e. length of the subsequence) is the G+C content of the corresponding subsequence of the original input, between those two positions.
long OligoToInt(char* s, int ls)
Translation from a string into a numerical value according to the function f such that f(A) = 0, f(C) = 1, f(G) = 2, f(T) = 3, and f(N) = 4. It is used to index arrays using olinucleotides by translating them into integers.
void MarkovScan(char* sequence, gparam* gp, float* OligoDistIni[3], float* OligoDistTran[3], long l1, long l2)
Score exons: pre-processing step. Exons are scored by using a Markov model: initial matrix and transition matrices. Score of a given exon is: score assigned for the first pentanucleotide (initial value) plus score computed for the hexanucleotides content derived from codon bias (transition values). To compute the transtion value, the accumulated values of scores for every possible (3) subsequence into the original input are computed. Then, to score an exon, the rest between the accumulated values for its 2 ends or delimiting positions must be computed. In this way, scoring one exon is executed with a constant cost instead of scanning the whole exon (linear), for every exon (linear time versus quadratic).
void HSPScan(packExternalInformation* external, packHSP* hsp, int Strand, long l1, long l2)
[Optional]. If homology information about the input sequence is provided, projection of HSPs overlapping current fragment of DNA is performed into an array of l2 - l1 +1 positions.
void HSPScan2(packExternalInformation* external, packHSP* hsp, int Strand, long l1, long l2)
[Optional]. If homology information about the input sequence is provided, the array containing HSP projection is preprocessed to save precomputed sums of every subset of positions in the current sequence fragment.
float ScoreHSPexon(exonGFF* exon, int Strand, packExternalInformation* external, long l1, long l2)
[Optional]. If homology information about the input sequence is provided, exons supported (total or partial intersection) by homology regions increase their score proportionally. HSPs are similarity to protein regions (projections over the sequence of blast High-scoring Segment Pairs in which, the best score for every position is recorded). .
long Score(exonGFF Exons, long nExons, long l1, long l2, int Strand, packExternalInformation external, packHSP* hsp, gparam** isochores, packGC* GCInfo)
Score exons: computing step. For every exon from the input, computing the coding potential score (Markov chain) by using a specific isochore according to its G+C content. Homology to protein score is computed from a provided list of HSPs. The final exon score result of a weighted combination (site, exon and homology factors) between coding potential score and score from both signals, plus homology score. There are cutoff points for the coding potential and final scores. It returns the number of exons overcoming the filter.
void ScoreExons(char Sequence, packExons allExons, long l1, long l2, int Strand, packExternalInformation* external, packHSP* hsp, gparam** isochores, int nIsochores, packGC* GCInfo)
Main exon scoring routine. control the data flow.

Back to top

SearchEvidenceExons

Description:
Management of how many annotations are going to be included in the current split together with ab initio predictions, depending on their positions (range l1,l2).
Briefing:
void SearchEvidenceExons(packExternalInformation* p, packEvidence* evidence, long l2)
Extracting the block (list) of annotations useful to integrate in the current fragment with the ab initio predictions. This block is defined by the counters i1vExons and i2vExons.
void SwitchCounters(packExternalInformation* p)
Prepare to process the next block of annotations from i2vExons up to forward in the following fragment of sequence.
void resetEvidenceCounters(packExternalInformation* p)
First block of annotations to be processed.

Back to top

Setratios

Description:
Estimate the amount of predicted signals and exons in a subsequence of length min(L,LENGTHSi), from the defined values (RSITES, REXONS) in the include file. Whether L is longer than LENGTHSi or not, space to backup some inter-fragments information would be required as well because the sequence will be divided and processed into (L / LENGTHSi) splits.
Briefing:
void SetRatios(long* NUMSITES, long* NUMEXONS, long* MAXBACKUPSITES, long* MAXBACKUPEXONS, long L)
From the defined values (geneid.h) RSITES, REXONS, RBSITES, RBEXONS, and length of the input sequence, an estimation for the amount of signals and exons of (every type) is computed, in order to ask for enough memory to allocate them, producing the values NUMSITES, NUMEXONS, MAXBACKUPSITES, MAXBACKUPEXONS. (Note: NUMEXONS will be modified before asking the memory, by using the ratios RSINGL, RFIRST, RINTER, RTERMI and RORF according to the type of exon because some types are more likely to find than others)

Back to top

SortExons

Description:
Sorting predicted exons of every type (first, internal, terminal, single and ORFs) and place all of them in the array used to assemble the best genes (maximum amount: NUMEXONS * FSORT).
Briefing:
void FreeItems(struct exonitem *q)
Free memory allocated for a list of nodes (exon, next node) in a recursive way.
void InsertFirstGhostExon(exonGFF* Exons)
Insertion of an artificial initial feature to force prediction of a complete gene in the current sequence.
void InsertLastGhostExon(exonGFF* Exons, long n, long L)
Insertion of an artificial terminal feature to force prediction of a complete gene in the current sequence.
void UpdateList(struct exonitem** p, exonGFF* InputExon)
Insert the input exon into a list according to the its position: create a new node and insert it at the end of the list. Note: all of the exons of the list begin at the same nucleotide.
void SortExons(packExons* allExons, packExons* allExons_r, packExternalInformation* external, packEvidence* pv, exonGFF* Exons, long l1, long l2)
Sorting all of predicted exons by using this method: insert every exon in a table of lists ExonList (range 1:Length of the fragment) where every position will contain a list of exons which begin in the same position. Then, screen the table, picking up the exons from every list and place them into the final array preserving the same order of traversing the table. If the option -F has been activated, artificial initial and terminal features are integrated in the array in the exons coming from first and last sequence fragments.

Back to top

SortHSPs

Description:
Module to implement a quicksort routine in charge of sorting the HSPs
Briefing:
int Split(HSP hsps,int i, int j)<br> int RSplit(HSP hsps,int i, int j)
Fix a pivot element and then splits the input array HSP(i..j) into the elements with a value lower (left half) and higher (right half). The R (reverse) routine is designed to obtain the reverse sorting.
void quickSort(HSP hsps, int i, int j)<br> void RquickSort(HSP hsps, int i, int j)
Main recursive routine of quicksort. The original array is recursively divided into fragments of elements which will be sorted following a divide and conquer algorithm. The R (reverse) routine is designed to obtain the reverse sorting.
void SortHSPs(packHSP* p)
Main routine of the module to call quicksort for sorting HSPs in each frame and strand.

Back to top

SwitchFrames

Description:
To exchange frame and remainder in reverse exons is necessary because while during exon construction, frame and remainder are dependent on the sense of reading (frame for the first uncomplete codon, following the reading sense, and remainder for the end), in genamic assembling frame is always associated to the left (geographical) uncomplete codon and remainder to the right one. Thus, gene assembling is reduced to assemble the pieces according only to some restrictions such as allowed rules (gene model) or frame-remainder constraints, forgetting different exon properties.
Briefing:
void SwitchFrames(exonGFF* e, long n)
Exchanging/restore frame and remainder in the exons predicted during current sequence fragment (only reverse exons).
void SwitchFramesDa(packGenes* pg, int nclass)
Exchanging frame and remainder in the exons saved during previous fragment processing into the d-array (only reverse exons). Using the flag selected, because some exons might be in more than one list and this change must be only once. This exchange will be restored after finishing genamic processing.
void SwitchFramesDb(packGenes* pg, int nclass)
Restore frame and remainder only in the exons saved during previous fragment processing into the d-array (only reverse exons). Using the flag selected, because some exons might be in more than one list and this change must be only once. (Note: Exons predicted in the current fragment will not have raised the flag selected. Thus, although they are right now in the d-array, their frame and remainder will not change during this function)
void UndoFrames(exonGFF* e, long n)
Restore frame and remainder in the input set of exons when they have been loaded from a gff file (partial prediction, only assembling).

Back to top

SwitchPositions

Description:
To exchange left and right signals to have exons matching: left signal < right signal. This property is required to work with the exons predicted in both strands, but forgetting their strand and focusing in their properties as pieces to assemble (exon type, strand, score) making up genes.
Briefing:
void SwitchPositions(packExons* allExons)
Exchange left (acceptor) signal and right (donor) signal in every type of exons predicted by reading the reverse sense sequence.

Back to top

Translate

Description:
This module implements the translation of genomic sequences (exons and genes) into proteins by using the Genetic Code (amino acid dictionary). The function to extract the exonic nucleotides is provided as well.
Briefing:
int Translate(long p1, long p2, short fra, short rmd, char* s, dict* dAA, char sAux[])
Translation of exons: obtaining the protein product by applying the Genetic code to the input exon sequence. First and last uncomplete codons are preserved (in lower case) corresponding to frame and remainder, respectively. Complete codons are translated by using the dictionary of amino acids. It returns the number of amino acids (including uncomplete codons) and the protein.
void TranslateGene(exonGFF* e, char* s, dict* dAA, long nExons, int** tAA, char* prot, long* nAA)
Translate a whole gene: translation of the list of linked exons that form the gene, taking care with the connections (frame/remainder) which are shared between 2 consecutive exons. Two different processes accoriding to the gene strand.
void GetcDNA(exonGFF* e, char* s, long nExons, char* cDNA, long* nNN)
Traversing the exons of the gene, picking up the nucleotides and joining them building the exonic sequence.

Back to top

geneidh

Description

Definitions of constants, data types and headers of functions used outside the module which contains the implementation.

Briefing

LENGTHSi	Input sequence is divided into fragments having this length.
OVERLAP	Nucleotides shared between 2 consecutive fragments.
RSITES	Given a DNA sequence, one signal per RSITES nucleotides is estimated to be predicted on (computing NUMSITES).
REXONS	Given a DNA sequence, one exon per REXONS nucleotides is estimated to be predicted on (computing NUMEXONS).
RBSITES	Given a DNA sequence, one signal per RBSITES nucleotides is the estimated ratio of sites which must be copied between 2 fragments (computing BACKUPSITES).
RBEXONS	Given a DNA sequence, one exon per RBEXONS nucleotides is the estimated ratio of exons which must be copied between 2 fragments (computing BACKUPEXONS).
RFIRST	NUMEXONS / RFIRST initial exons are supposed to be predicted in LENGTHSi bases.
RINTER	NUMEXONS / RINTER internal exons are supposed to be predicted in LENGTHSi bases.
RTERMI	NUMEXONS / RTERMI terminal exons are supposed to be predicted in LENGTHSi bases.
RSINGL	NUMEXONS / RSINGL single genes are supposed to be predicted in LENGTHSi bases.
RORF	NUMEXONS / RSINGL ORFs are supposed to be predicted in LENGTHSi bases.
FSORT	FSORT * NUMEXONS is the maximum number of predicted exons (both strands) in LENGTHSi bases.
NUMEVIDENCES	Maximum number of annotations per locus read from optional file.
MAXHSP	Maximum number of HSPs (protein homology) read from optional file (per locus, strand and frame).
MAXNSEQUENCES	Maximum number of locus in multi-fasta files.
MAXGENE	Maximum number if predicted genes.
MAXEXONGENE	Maximum amount of exons in every gene.
MAXAA	Maximum length of predicted proteins (in amino acids).
MAXCDNA	Maximum length of exonic sequence for a gene (cDNA).
MAXISOCHORES	Maximum number of isochores in the parameters file.
EXONLENGTH SINGLEGENELENGTH ORFLENGTH	Minimum allowed size for internal exons, single genes and ORFs.
MINEXONLENGTH MINSCORELENGTH	Minimum size for exons to compute protein coding potential (score if not).
ISOCONTEXT	Size of region around an exon when its G+C content is computed.
NULL_OLIGO_SCORE	Penalty for N's in scoring exons.
LOCUSLENGTH	Maximum number of chars for the name of the input sequence.
OLIGOLENGTH	Maximum length for the oligonucleotides used in Markov chains (scoring sites and exons).
VERSION	Current geneid version.
SITES, EXONS, EVIDENCE	Field feature in gff standard.
FRAMES, STRANDS FORWARD, REVERSE ACC, DON, STA, STO FIRST, INTERNAL, TERMINAL SINGLE ORF, LENGTHCODON PERCENT, MEGABYTE MAXTIMES, PROT, DNA, MINUTE	Numerical constants.
sFORWARD, sREVERSE, sACC sDON, sSTA, sSTO, sFIRST sINTERNAL, sTERMINAL, sSINGLE sORF, sEXON, xmlFORWARD xmlREVERSE	String constants.
BLOCK, NONBLOCK	Mark gene model rules as blocking or not blocking. To preserve or not the group structure in gene annotations (evidences).
COFFSET	Correction because of arrays in C starts from 0 to N -1.
MAXLINE	Maximum number of characters read from input line.
MAXSTRING	Maximum allowed length of geneid messages and strings.
MAXENTRY MAXTYPE MAXINFO NOTFOUND	Dictionary (hash table) definitions.
HASHFACTOR	Computing size of hash table to save best genes (sites, exons).
PARAMETERFILE	Default parameters file: filename.
FILENAMELENGTH	Maximum size for filenames.
INFI, sINFI, INF	Representing the infinity value.
MAXSCORE	Default value for annotations forced to appear in the final prediction.
FASTALINE	Lenght (maximum) for fasta lines
MESSAGE_FREQ	Display message with amount of sequence read (frequency).
NOGROUP	String to represent (ab initio) exons without group (field 9 in gff).

Back to top

Previous Chapter · . Next Chapter

Source Code Documentation

account

BackupGenes

beggar

BuildAcceptor

BuildInitialExons

BuildInternalExons

BuildORFs

BuildSingles

BuildSort

BuildTerminalExons

CookingGenes

ComputeStopInfo

CorrectExon

Dictionary

DumpHash

FetchSequence

genamic

geneidc

GetSitesWithProfile

GetStopCodons

manager

output

PrintExons

PrintSites

readargv

ReadExonsGFF

ReadGeneModel

ReadParam

ReadSequence

ReadHSP

RecomputePositions

RequestMemory

ScoreExons

SearchEvidenceExons

Setratios

SortExons

SortHSPs

SwitchFrames

SwitchPositions

Translate

geneidh

Clone this wiki locally