-
Notifications
You must be signed in to change notification settings - Fork 6
Source Code Documentation
Emilio Righi edited this page Jan 19, 2023
·
1 revision
Class Name | Description | Code |
---|---|---|
account.c | Description | Code |
BackUpGenes.c | Description | Code |
beggar.c | Description | Code |
BuildAcceptor.c | Description | Code |
BuildInitialExons.c | Description | Code |
BuildInternalExons.c | Description | Code |
BuildORFs.c | Description | Code |
BuildSingles.c | Description | Code |
BuildSort.c | Description | Code |
BuildTerminalExons.c | Description | Code |
CookingGenes.c | Description | Code |
ComputeStopInfo.c | Description | Code |
CorrectExon.c | Description | Code |
Dictionary.c | Description | Code |
DumpHash.c | Description | Code |
FetchSequence.c | Description | Code |
genamic.c | Description | Code |
geneidc.c | Description | Code |
GetSitesWithProfile.c | Description | Code |
GetStopCodons.c | Description | Code |
manager.c | Description | Code |
output.c | Description | Code |
PrintExons.c | Description | Code |
PrintSites.c | Description | Code |
readargv.c | Description | Code |
ReadExonsGFF.c | Description | Code |
ReadGeneModel.c | Description | Code |
ReadParam.c | Description | Code |
ReadSequence.c | Description | Code |
ReadHSP.c | Description | Code |
RecomputePositions.c | Description | Code |
RequestMemory.c | Description | Code |
ScoreExons.c | Description | Code |
SearchEvidenceExons.c | Description | Code |
Setratios.c | Description | Code |
SortExons.c | Description | Code |
SortHSPs.c | Description | Code |
SwitchFrames.c | Description | Code |
SwitchPositions.c | Description | Code |
Translate.c | Description | Code |
geneidh.c | Description | Code |
Description: |
To save accounting information such as statistics about amount of predicted
sites and exons or (CPU and real) time consumed for the current running. This
data will be displayed at the end of the processing (verbose mode).
|
Briefing: |
account* InitAcc() |
Allocation of accounting data structure and setting the counters (including
the time counter).
|
void updateTotals(account *m,
packSites* allSites,
packSites* allSites_r,
packExons* allExons,
packExons* allExons_r) |
Update stats for every type of sites and exons. Compute the total number
of sites and exons (both strands and together) in the input sequence.
|
void cleanAcc(account* m) |
Reset total number of sites and exons counters for the next input sequence.
|
Description: |
Data structures about predicted signals and exons are emptied and fill in
in every fragment processing, but some exons (and genes) are needed to be
used in the next split and therefore they must be temporally in the arrays
of sites and exons into the dumpster. To avoid copying twice or more times
the same exon, a hash table (DumpHash) is available to record which exons have been
already copied and their address. Gene structure (linked exons) must be
preserved between copies of genes.
|
Briefing: |
long IncrMod(long x, long Modulus) |
Increase values modulus an input long number.
|
exonGFF* backupExon(exonGFF* E,
exonGFF* Prev,
packDump* d) |
It saves an exon (properties and sites) into the dumpster. Returns the
pointer to the new copy for this exon.
|
exonGFF* backupGene(exonGFF* E, packDump* d) |
It saves the gene (exons, sites, properties) which last exon is E, into the
dumpster. For every exon of this gene, check whether it has already been
copied or not, by looking up the hash table. Note: Whenever
any exon of a gene is found to be in the dumpster, it is not necessary to go
on copying the rest of the gene, because, from this point, both genes will
be exactly equal (dynamic programming). This is because there is only one
best gene finished in this exon, whatever exons have forward. Obviously, this
function is recursive due to the need to conserve exon links (to backup one
exon, its previous exon must have already been copied, and therefore the
address is known).
|
void BackupGenes(packGenes* pg, int nclass, packDump* d) |
To save best partial assembled genes and temporary optimal gene from
packGenes structure to be used the next fragment of the input sequence.
|
void BackupArrayD(packGenes* pg, long accSearch,
gparam* gp, packDump* dumpster) |
From the set of exons input to genamic (gene assembling algorithm), some
ordering functions by donor (right position) are computed for every class
or assembling rule in the gene model. Most exons are used in the same
iteration where they were produced, but a few of them in the intersection
between two fragments of sequence, must be saved to be able to recover the
gene assembling process in the next iteration, checking minimum/maximum
distance requirements as well.
|
void cleanGenes(packGenes* pg,
int nclass,
packDump* dumpster) |
Preparing sort-by-donor functions and best partial genes structures for
the next DNA sequence (reset counters and pointers).
|
Description: |
Statistics about memory requirements. This option displays
the information and stops geneid running.
|
Briefing: |
void beggar(long L) |
Estimate the memory needed to run geneid (current configuration)
|
Description: |
Statistics about memory requirements. This option displays
the information and stops geneid running.
|
Briefing: |
float ComputeExtraProfile(char* s,
long positionAcc,
long limitRight,
profile* p) |
Given a predicted acceptor site, search the best BPoint
or PPTract juest before the AG signal.
|
Briefing: |
long BuildAcceptors(char* s,
profile* p,
profile* ppt,
profile* bp,
site* st,
long l1,
long l2) |
Prediction of acceptor sites, taking into account the
associated signals (BPoint, PPTract) to the characteristic
AG core.
|
Description: |
Initial (first) exon construction: first exons are DNA regions beginning
in a start codon (ATG) and finishing in a donor site (GT), not including
any stop codon in frame with the translation start.
|
Briefing: |
long BuildInitialExons(site *Start, long nStarts,
site *Donor, long nDonors,
site *Stop, long nStops,
int MaxDonors,
char* Sequence,
exonGFF *Exon) |
Every list of signals is sorted by position due to the signal prediction
process. For every Start codon, donors between the Start and the first Stop in
frame are candidates to be First exons. However, there is a limited maximum
amount of exons to build, (parameter MaxDonors), being allocated during the
processing in a local array. If there is no room for more exons beginning
by this Start, the worst donor will be rejected. The output list of exons
is sorted by Start position. According to the nucleotides at the end of the
exon, some information is computed to detect possible Stop codons when
assembling with other exons.
|
Description: |
Internal exon construction: internal exons are DNA regions beginning
in a acceptor splice site (AG) and finishing in a donor site (GT),
not including any stop codon in frame. Internal exons may have the first
and the last codon uncomplete (frame and remainder). Thus, to get the true
frame, the length of the uncomplete codon must be added to the left position of the exon so that
stops in frame with this new frame can be searched.
Two exons (E,F) are allowed to be assembled (E-F) only if remainder(E)
+ frame(F) = 3, to build a complete codon.
|
Briefing: |
long BuildInternalExons(site *Acceptor, long nAcceptors,
site *Donor, long nDonors,
site *Stop, long nStops,
int MaxDonors,
char* Sequence,
exonGFF* Exon) |
Every list of signals is sorted by position due to the signal prediction
process. For every couple (Acceptor, Donor), three exons might be built
according to the number of frames closed by the Stops following the current
Acceptor. There is a limited maximum amount of exons to build, (parameter
MaxDonors per frame), being allocated during the processing in a local array. Moreover, a limited
minimum length required is provided (EXONLENGTH). If there is no room for
more exons beginning by this Acceptor, the worst donor will be rejected. The
output list of exons is sorted by Acceptor position. According to the nucleotides at the end of the exon, some information is computed to detect possible Stop codons when assembling with other exons.
|
Description: |
Open Reading Frames (ORFs) construction: ORFs are DNA regions
beginning in one start codon (ATG), being as long as it is possible, finishing
with a stop codon, and not including any other previous stop codon
in frame.
|
Briefing: |
long BuildORFs(site *Start, long nStarts,
site *Stop, long nStops,
long cutPoint,
exonGFF *Exon) |
Every list of signals is sorted by position due to the signal prediction
process. For every start codon, looking for the first stop codon in frame
to build the ORF. The CutPoint value is necessary to keep the order
between the set of exons predicted on one fragment and the set predicted on
the next one. ORFLENGTH value checks the minimal length of a single
gene. The output list of exons is sorted by Start codon position.
|
Description: |
Single genes construction: single gene exons are DNA regions beginning
in one start codon (ATG) and finishing in a stop codon in frame
(TGA/TAG/TAA), not including any other previous stop codon in frame.
|
Briefing: |
long BuildSingles(site *Start, long nStarts,
site *Stop, long nStops,
long cutPoint,
exonGFF *Exon) |
Every list of signals is sorted by position due to the signal prediction
process. For every start codon, looking for the first stop codon in frame
to build the single gene. The CutPoint value is necessary to keep the order
between the set of exons predicted on one fragment and the set predicted on
the next one. SINGLEGENELENGTH value checks the minimal length of a single
gene. The output list of exons is sorted by Start codon position.
|
Description: |
Given one exon, the main goal is to find the best gene finished in that
exon (dynamic programming), looking for every exon which may be placed before
it, according to the assembling rules expressed
in the gene model, and selecting the best one (higher gene
score). Given a rule or gene class, the set of exons that may be used
at the left part of the rule are stored, sorted by donor (right) position,
in the d-array structure (plus extra array of counters). This module sorts
by donor the input exons according to every class rules and the type of exons.
As the set of input exons were initially ordered by acceptor, donor
re-arrangements are very similar and preserves quite well the original order.
|
Briefing: |
void BuildSort(dict* D,
int nc[],
int ne[],
int UC[][MAXENTRY],
int DE[][MAXENTRY],
int nclass,
long km[],
exonGFF* **d,
exonGFF* E,
long nexons) |
Essentially, for every exon and the set of available rules or classes (in which
its type is at the left part): to insert into the sorting function by
running the sorting algorithm by insertion.
(Note: As input exons are sorted by acceptor
and this order is partially maintained in the arrangement by
donor, there will be many simple insertion operations and few operations
of shifting in the d-arrays (linear time cost)).
|
Description: |
Terminal exon construction: terminal exons are DNA regions beginning
in one acceptor site (AG) and finishing in a stop codon in frame
(TGA/TAG/TAA), not including any other previous stop codon in frame.
|
Briefing: |
long BuildTerminalExons (site *Acceptor, long nAcceptors,
site *Stop, long nStops,
long LengthSequence,
exonGFF* Exon,
long cutPoint) |
Every list of signals is sorted by position due to the signal prediction
process. For every acceptor codon there are three possible exons to build
depending on the corresponding stop in frame exists. The CutPoint value
is necessary to keep the order between the set of exons predicted on one
fragment and the set predicted on the next one. The output list of exons is
sorted by Acceptor site position.
|
Description: |
Terminal exon construction: terminal exons are DNA regions beginning
in one acceptor site (AG) and finishing in a stop codon in frame
(TGA/TAG/TAA), not including any other previous stop codon in frame.
|
Briefing: |
long BuildTerminalExons (site *Acceptor, long nAcceptors,
site *Stop, long nStops,
long LengthSequence,
exonGFF* Exon,
long cutPoint) |
Every list of signals is sorted by position due to the signal prediction
process. For every acceptor codon there are three possible exons to build
depending on the corresponding stop in frame exists. The CutPoint value
is necessary to keep the order between the set of exons predicted on one
fragment and the set predicted on the next one. The output list of exons is
sorted by Acceptor site position.
|
Description: |
Genes are series of connected exons. It sometimes happens that one stop codon is made up by
the final nucleotides of the left exon and the last nucleotides of the right one. To avoid
this situation, geneid records these features for every exon. Then, during the assembly, this
requirement is evaluated.
|
Briefing: |
void ComputeStopInfo(exonGFF* e, char* s) |
Record which nucleotides were at the end of the current exon according to the content
of the input sequence. Depending on the lValue (beginning) and rValue (end),
during the assembly, a connection will be allowed or not.
|
Description: |
Fixing the coordinates for predicted exons. Arrays and most processing in
language C start in 0 to L-1, while positions of predictions in the real
sequence ranges from 1 to L. Moreover, stop codons are included into the
Terminal exons, Single Genes and ORFs.
|
Briefing: |
void CorrectExon(exonGFF* e) |
Increase/decrease positions (offset1, offset2) due to the C offset in the
arrays. Include stop codon in Terminal exons and Single genes.
|
void CorrectORF(exonGFF* e) |
Increase/decrease positions (offset1, offset2) due to the C offset in the
arrays. Stop codon included.
|
Description: |
This module implements a look-up table or dictionary to simulate associative
arrays by using a hash table that binds unique integer keys to gene features
such as "First" or "Internal" (combined with the strand "+" or "-"). There
is enough room for MAXENTRY different features.
|
Briefing: |
void resetDict(dict* d) |
Reset the counter nextFree (assign new key) and NULL pointers in the hash
table of sinonimous.
|
int f(char s[]) |
Hash function: weighted sum of chars of the input string modulus the length
of the hash table to get an integer between 0..MAXENTRY-1.
|
int setkeyDict(dict* d, char s[]) |
Assign a new key for the input string if it was not into the hash table.
Management of computing hash function and colissions.
|
int getkeyDict(dict* d, char s[]) |
Finding a word into the dictionary by computing the hash function
and looking up the proper list of nodes in the hash table. NOTFOUND
is returned if the word is not found, but the key is returned if it exists.
|
void showDict(dict* d) |
Display the nodes (string,key) stored into the dictionary.
|
void freeNodes(pnode node) |
Set free a list of sinonimous (colissions).
|
void freeDict(dict* d) |
Set recursively free the dictionary hash table of sinonimous.
|
void setAADict(dict* d, char s[], char aA) |
Store the Genetic Code (1 codon : 1 amino acid) in a dictionary.
|
char getAADict(dict* d, char s[]) |
Returns the translation of the input codon by using the Genetic Code.
|
Description: |
Between one fragment of DNA input sequence and the next one to be
processed, some useful information must be saved to go the prediction on:
best partial genes and exons still not used in the gene assembling. But
between best partial genes, there is a lot of redundancy (the same or similar
set of exons). Every exon must be saved only once, not more, so that by
using the dumspter (hash table of backup exons), is very easy to know
whether one exon has already been copied or not. Exon features as signal
positions, exon type or strand are used to bind unique keys to these exons.
|
Briefing: |
void resetDumpHash(dumpHash* h) |
Initialize the hash table and set the counter of exons copied.
|
long fDump(exonGFF* E) |
Hash function: integer computed from exon features.
|
void setExonDumpHash(exonGFF* E, dumpHash* h) |
Insert the input exon into the dumpster hash table after have been copied.
|
exonGFF* getExonDumpHash(exonGFF* E, dumpHash* h) |
Finding an exon into the hash table to know it must be whether copied or not.
If it has already been copied before, the address of the copy is returned.
|
void freeDumpNodes(dumpNode* node) |
Free recursively a list of sinonimous nodes.
|
void cleanDumpHash(dumpHash *h) |
Free all of lists of sinonimous nodes in the hash table.
|
Description: |
Prepare the input sequence (changing lower to upper cases, if needed),
producing the reverse sense sequence which will be processed in parallel with
the original sequence.
|
Briefing: |
int complement(int c) |
Return the complementary nucleotide to c.
|
long FetchSequence(char *s, char* r) |
Running from both ends in the input sequence, reversing and complementing
the nucleotides to produce the reverse sense sequence. Return the length of
both sequences. Output sequence has been previously allocated.
|
void ReverseSubSequence(long p1, long p2,
char* s, char* r) |
Reverse and complement a fragment of DNA from p1 to p2 in the input sequence.
Output sequence has been previously allocated.
|
Description: |
Assembling the best gene from an input set of exons by using the algorithm
genamic which is based on dynammic programming techniques. The best
gene is the series of linked exons which have the highest sum of scores respecting
the allowed rules (gene model). Optimal result is guaranteed. Exons must be
sorted by acceptor (left, minor) position and within genamic, for every
assembling rule (class), exons are also sorted by donor (right, major)
position. The goal is to assemble the best gene finished with every exon and
then, return the highest score produced gene. Essentially, for every exon
in the input, there are 3 possible assemblings: remain alone, join to the
gene assembled to the previous input exon or join to the best gene that
finishes between the previous input exon and it. Having acceptor and
donor sorting functions, every exon (and the associated gene finishing
with it) is used only once, and therefore, genamic is a linear time algorithm
(respect to the number of input exons). Annotations
or evidences may be included among the predicted exons. Group field is
then used to allow the mix between ab initio and evidence exons or
to force that only exons having same group can be joined using a blocked
rule. To block a rule, use the block keyword in the selected rule of the
gene model.
|
Briefing: |
void genamic(exonGFF* E,
long nExons,
packGenes* pg,
gparam* gp) |
genamic processing:
|
Description: |
Main program: management and control of geneid data flow.
|
Briefing: |
void main (int argc, char *argv[]) |
geneid main program actions:
There is a partial prediction mode: reading exons directly from input file and running assembling algorithm (genamic) to display the best gene predicted |
Description: |
Search by signal: prediction of start codons, acceptor splice sites and donor
splice sites by using a Position Weighted Array (a position weight matrix where
every position is a Markov chain instead of a simple nucleotide distribution
function). Predicted signals score must be higher than a fixed cutoff score
depending on the type. The recorded position for predicted signals is:
A from (ATG) for starts, X from XGT for donors and X from AGX for acceptors.
|
Briefing: |
long GetSitesWithProfile(char* s,
profile* p,
site* st,
long l1,
long l2) |
Scan the input sequence applying the PWA to every fragment candidate to
contain a true signal (length = profile.dimension). Applying the PWA: for
every position i, look for the probability of finding the (i-k..i) oligonucleotide in
this position, being the candidate a real signal, over the probability being
a false signal. In every position, the Markov chain is different, and the core is the set of
consecutive positions where the bias is complete (k fixed nucleotides
with probability 0 or 1, i.e. the characteristic dinucleotide for donors is
GT in the core). If the order of Markov chain is 0 or 1, to look up the Markov string
is done directly, while a loop is required for order higher than 1
(trinucleotides and so on). Candidate regions obtaining a higher than
cutoff score are inserted into the result list (array). Returned the number
of final predicted signals.
|
Description: |
Search by signal: prediction of stop codons by using a Position Weighted
Array (a position weight matrix where every position is a Markov chain
instead of a simple nucleotide distribution function). Predicted stops
score must be higher than a fixed cutoff score. For every stop
the recorded position is the last coding nucleotide before the core
TGA|TAG|TAA.
|
Briefing: |
long GetStopCodons(char* s,
profile* p,
site* sc,
long l1,
long l2) |
Scan the input sequence applying the PWA to every fragment candidate to
contain a true signal (length = profile.dimension). Applying the PWA: for
every position i, look for the probability of finding the (i-k..i) oligonucleotide
in this position, being the candidate a real signal, over the probability being
a false signal. In every position, the Markov chain is different, and the core is the set of
consecutive positions where the bias is complete (k fixed nucleotides
with probability 0 or 1). If the order of Markov chain is 0 or 1, to look up the Markov string
is done directly, while a loop is required for order higher than 1
(trinucleotides and so on). Candidate regions obtaining a higher than
cutoff score are inserted into the result list (array).
Before finishing, stop codons at the end of sequence are computed as well
on regions smaller than dimension of the profile.
Returned the number of final predicted stops.
|
Description: |
Control of signal and exon prediction and scoring functions, both in
positive and negative strand processing, in a range (l1,l2). Main parameters
are the input sequence (forward or reverse), length of sequence, coordinates
of the ends of the fragment, the pack of sites and exons to be filled in,
scoring model (isochores) and GC information precomputed before about G+C
content on that split.
|
Briefing: |
void manager(char *Sequence,
long LengthSequence,
packSites* allSites,
packExons* allExons,
long l1, long l2,
int Strand,
packExternalInformation* external,
packHSP* hsp,
gparam* gp,
gparam** isochores,
int nIsochores,
packGC* GCInfo) |
NOTE: Reverse prediction implies recomputing signal positions according to the forward sense of the original sequence (remember they have been predicted reading the reverse sequence and therefore, having reverse coordinates). Moreover, left and right signals must be exchanged to preserve the property: left signal position < right signal position. |
Description: |
Output management: according to the options selected by the user, display the
results, information about processing (if verbose), and errors.
|
Briefing: |
void printMess(char* s) |
Display information about geneid running (if verbose).
|
void printRes(char* s) |
Display information about geneid results and statistics (if verbose).
|
void printError(char* s) |
Display errors happened during geneid running (aborted).
|
void printReadingInfo(char* s) |
Display number of nucleotides read from input DNA sequence.
|
void PrintProfile (profile* p, char* signal) |
Print Position Weight Array parameters: type, offset, dimension,
Markov order and cutoff.
|
void OutputHeader(char* locus, long l) |
Display information about DNA input sequence loaded (length, locus). Output
headers depending on the format selected: GFF, XML or geneid.
|
void Output(packSites* allSites,
packSites* allSites_r,
packExons* allExons,
packExons* allExons_r,
exonGFF* exons,
long nExons,
char* Locus,
long l1,
long l2,
char* Sequence,
gparam* gp,
dict* dAA) |
Display results in the proper format according to user preferences: sites
and exons (separated strands) and/or the set of predicted (and sorted) exons.
|
void OutputGene(packGenes* pg,
long nExons,
char* Locus,
char* Sequence,
gparam* gp,
dict* dAA) |
Output the best predicted genes by calling to CookingGenes routine.
|
void OutputStats(char* Locus) |
Display the overall amount of (every type) predicted sites and exons.
|
void OutputTime() |
Display the total time used to process the input sequence (CPU and user time).
|
Description: |
Output predicted exons, selecting the proper format according to the
setup options and whether the exon is part of a gene (gff/xml/geneid)
or not (gff/geneid).
|
Briefing: |
void PrintExon(exonGFF *e, char Name[], char* s, dict* dAA) |
Print a exon, previously translated into amino acids, using the proper
output format taking into account the positions plus offsets (corrections),
if exon is not an annotation.
|
void PrintExons (exonGFF *e,
long ne,
int type,
char Name[],
long l1, long l2,
char* Sequence,
dict* dAA) |
Print a list of exons by using the proper output format.
|
void PrintGExon(exonGFF *e,
char Name[],
char* s,
dict* dAA,
long ngen,
int AA1,
int AA2,
int nAA) |
Print an exon which is part of a gene (identifer, amino acid positions) in
gff/geneid format. For annotations, type field is different.
|
void PrintXMLExon(exonGFF *e,
char Name[],
long ngen,
long nExon,
int type1,
int type2,
int nExons) |
Print an exon which is part of a gene by using the XML format.
|
Description: |
Print lists of predicted signals (motif, score, position) using the selected
output format. There are some parts of the list which will be not output to
avoid printing twice the same signal (overlapping of fragments implies
repeated computing). Before printing, signal limits are modified to display
the positions of the core (AG, GT, ATG or TGA|TAG|TAA) as pos1 and pos2. Motifs
are always printed using the correct reading sense (+ or -) as positive.
|
Briefing: |
void PrintSite(site* s, int type,
char Name[], int Strand,
char* seq, profile* p) |
Processing information about a signal to print its motif but anchored at the
core (ATG, AG, GT or TGA|TAA|TAG) by computing the values k and offset (see
profiles), and taking into account the COFFSET: everything according to the
signal type and format (gff/geneid). Output motif, score and position of the
input site.
|
void PrintSites (site* s,
long ns,
int type,
char Name[],
int Strand,
long l1, long l2,
char* seq,
profile *p) |
Print a list of sites, but only the signals between "i" and "printPoint" which
are values depending on the type of the signals. This is to avoid print the
same signal twice in two different fragments due to the overlapping.
|
Description: |
Read the command line options selected by the user, checking incompatibilities
among options or problems with the number of input filenames. Every chosen
option is represented by raising the corresponding flag (global var).
|
Briefing: |
void printHelp() |
Output the list of available options.
|
void printDTD() |
Output the Document Type Definition to validate the XML output (only for best
predicted genes).
|
void readargv (int argc,char* argv[],
char* ParamFile, char* SequenceFile,
char* ExonsFile, char* HSPFile) |
Read the options selected by the user watching incompatibilities (among
different options and about number of input filenames), setting the proper
global vars (geneid.c) and acquiring the name of different external
but optional filenames: parameter, exons (evidences) and similarity to
protein regions.
|
Description: |
Read an external file of exons. These annotations will be added in the
final gene prediction either forcing to include them or mixing with the
ab initio predictions (depending on the existence of value in the group
field (gff) of annotations to preserve or not the complete annotated gene).
GFF format line (tab as field separator):
|
Briefing: |
long ReadExonsGFF (char *FileName,
packEvidence* pv,
dict* d) |
Input exons from an external file. They must be sorted by the left position
(increasing order). It returns the number of created annotations. Exons
with unknown (not in gene model) features are skipped, displaying one warning
for every one.
|
packEvidence* SelectEvidence(packExternalInformation* external,
char* Locus) |
Select the group of annotations to be integrated into the predictions according
to current Locus name (using a hash table - dictionary of locus).
|
Description: | |||
Acquiring the information to establish the allowed connections between
exon types, from the parameter file. Gene model might therefore be modified
by the user, changing/adding features, distances or switch group restriction on.
Gene model rules have this form:
where
|
|||
Briefing: | |||
void shareGeneModel(gparam** isochores, int nIsochores) |
|||
To share the gene model information between several isochores by using
the same pointers to the dictionary of features and other arrays and
copying extra information.
|
|||
long ReadGeneModel (FILE *file, dict *d,
int nc[], int ne[],
int UC[][MAXENTRY],
int DE[][MAXENTRY],
long md[], long Md[],
int block[]) |
|||
Parsing the gene model rules to extract for every feature this information:
lines (rules) where it was at the left part (upstream), at the right
(downstream). For every rule (class or line), to save the range of distances
allowed to connect and whether the restriction of having the same group
is raised or not. This information is stored at the dictionary data structure
and some extra structures such as UC, DE, nc, ne and distances arrays. If the option -F is activated, a couple of artificial rules with feature sGHOST
are introduced to force one complete gene prediction.
|
Description: |
Reading the statistical model to predict both signals and exons (scoring
functions) and the gene model rules which allowed connections between
exons according to their type and the distance between. Data is input from
a separate file (parameter file) to an array of isochore data structure.
|
Briefing: |
void readLine(FILE *File, char* line) |
Read one line of numerical values (maximum MAXLINE characters). Skip
comment lines (begin by '#') and empty lines.
|
void readHeader(FILE *File, char* line) |
Read one line (name of some parameter) (maximum MAXLINE characters). Skip
comment lines (begin by '#') and empty lines. Headers are textual labels,
so user may modify them without being verified the changes.
|
void ReadProfile(FILE *RootFile, profile* p, char* signal) |
Acquire statistics about signal prediction: definition of the position
weight array (length, offset, cutoff and Markov chain order profile)
and transition probabilities.
|
void ReadIsochore(FILE *RootFile, gparam* gp) |
Read and save the information about signal/exon prediction in an specific isochore
(DNA region with a well-defined and biased G+C content):
|
int readparam (char *name, gparam** isochores) |
Main routine for the management of reading parameter file:
|
Description: |
Reading DNA sequences, placing them in a previously allocated string which
will post-processed (reverse and complement) then. More than one sequence
is allowed to be in the input file but always respecting the FASTA format.
|
Briefing: |
long analizeFile(char* SequenceFile) |
Read the size of the input file to estimate the size of its DNA sequence.
|
int IniReadSequence(FILE* seqfile, char* line) |
Reading the locusname of first DNA sequence in the input file.
(Remember this file is allowed to contain more than one fasta sequence)
|
int ReadSequence (FILE* seqfile,
char* Sequence,
char* nextLocus) |
Reading the current sequence and get the locus name from the next sequence
until there are not more sequences in the input file. (Multiple
string locus names and dynamic length of fasta lines are allowed)
|
Description: |
Read the external file containing HSPs from sequence alignments.
Blast High-scoring Segment Pairs over the input sequence are projected over
the input sequence, taking the best value in every position. Positions without
homology support are assigned the value NO_SCORE (parameter file).
By using these homologous regions to the input DNA sequence, exons overlapping
any HSP will be enhancered and better predictions will be obtained.
|
Briefing: |
packHSP* SelectHSP(packExternalInformation* external,
char* Locus,
long LengthSequence) |
Select the group of HSPs to be integrated into the predictions according
to current Locus name (using a hash table - dictionary of locus). Both FWD and
RVS HSPs will be sorted using a quicksort routine.
|
long ReadHSP (char* FileName,
packExternalInformation* external) |
To place every HSP into the correct array according to their blast frame (1,2,3) and strand. If frame is unknown (blastn), then 3 copies of the same HSP will be produced. (It IS necessary a previous sorting in the HSP file)
|
Description: |
Position of predicted signals by reading the reverse (negative) sense on
the original input sequence must be translated (normalised) into coordinates in the
forward sense (normal) to be mixed with the forward predictions. Reverse
exons defined with those signals have the strand property "-" (reverse) to
distinguish from forward exons.
|
Briefing: |
void RecomputePositions(packSites* allSites, long l) |
Translation from positions (P-) in reverse sequence into
positions (P+) in forward sequence by computing:
P+ = L - P- - 1 where L is the length of the input sequence. |
Description: |
This set of functions are the responsible of memory management, allocating
main geneid data structures, from predictions to parameters and statistical
models, as well as temporary structures reset in every fragment of the
input sequence.
|
Briefing: |
account* RequestMemoryAccounting() |
Memory for account data structure.
|
char* RequestMemorySequence(long L) |
Memory for a DNA sequence.
|
packSites* RequestMemorySites() |
Memory for main structure pack of sites (forward or reverse sense)
and every type of signal: acceptor and donor splice sites, start and
stop in translation.
|
packExons* RequestMemoryExons() |
Memory for pack of exons and every type of exons:
NUMEXONS is modified (divided) by RSINGL, RFIRST, RINTER, RTERMI and RORF
ratios because some type of exons (internal and terminal) appear more
frequently than others.
|
exonGFF* RequestMemorySortExons() |
Memory to sort exons found in the current fragment.
|
packEvidence* RequestMemoryEvidence() |
Memory for input annotations (evidences).
|
packHSP* RequestMemoryHomology() |
Memory for input similarity to protein regions (homology).
|
packExternalInformation* RequestMemoryExternalInformation() |
Memory for external information (dictionary of Locus names)
|
packGC* RequestMemoryGC() |
Memory for the precomputed array of G/C's and N's nucleotides on a
DNA subsequence.
|
gparam* RequestMemoryParams() |
Memory for the statistical model of one isochore loading from
the parameter file: profiles, Markov initial/transition arrays, exon scoring
parameters. Memory for temporary arrays which compute accumulated sum of
penta/hexanucleotides in every split.
|
gparam ** RequestMemoryIsochoresParams() |
Memory for the array of isochores and their common information:
gene model dictionary and allowed distances arrays.
|
void RequestMemoryProfile(profile* p) |
Memory for every position containing a Markov chain inside in a Position
Weight Array (profile).
|
packGenes* RequestMemoryGenes() |
Memory for the main structure, Ghost exon, best partial genes (Ga) and
sorting by donor functions (d-array) in every gene class, necessary in
genamic.
|
packDump* RequestMemoryDumpster() |
Memory for backup information between two fragments: sites, exons
and a hash table useful to save only once every exon and avoid
usual redundancy between best partial genes (they share most exons).
|
dict* RequestMemoryAaDictionary() |
Memory for the genetic code, a hash table to translate from predicted genes
or exons into proteins.
|
Description: |
This module is implemented to score (to give a measure of reliability) and
filter predicted exons. There are 3 different scoring sources which make
up the final value for a given exon: score from signals (sites), score from
protein coding potential probability and score from provided homology
information. Statistical parameters for every type of exon are extracted
from parameters file. For protein coding potential, a Markov model of order
5 is employed, supporting different isochores usage for predictions on
different G+C content sequences. G+C frequencies and Markov transition
scores are computed by using the accumulated sum technique, with a linear
cost instead of the usual quadratic value.
|
Briefing: |
int SelectIsochore(float percent, gparam** isochores) |
Given a float value (between 0 and 1) representing the G+C value in a DNA
region, returns the identifier of the isochore whose coding potential
Markov model is adapted to work under this range. Isochores are not supposed
to be sorted and range is not verified anywhere so it is strongly recommended
to be careful when parameter file is modified.
|
float ComputeGC(packGC* GCInfo, long inigc, long endgc) |
G+C content (percentage): computing step. For every region, subsequence of
the original sequence or fragment, the percent of G+C is quickly computed
by using the accumulated sum technique: instead of scanning the sequence
whenever is necessary to count the G+C percentage in a subsequence of the
original input (i.e. exons), it is much more efficient to write down the frequency
of a given nucleotide until every position, and then, the absolute frequency
for that nucleotide between 2 positions is the rest of both accumulated
values (Linear time versus quadratic cost). Unknown nucleotides (N) are not
taken into account because sequences sometimes might contain an important
amount of them.
|
void CGScan(char* s, packGC* GCInfo, long l1, long l2) |
G+C content (percentage): pre-processing step. Scan the whole sequence, counting
how many C/Gs or Ns are in. Then, resting the accumulated values stored
in any two positions (i.e. start and end of a subsequence) divided by the
rest of these values (i.e. length of the subsequence) is the G+C content
of the corresponding subsequence of the original input, between those
two positions.
|
long OligoToInt(char* s, int ls) |
Translation from a string into a numerical value according to the
function f such that f(A) = 0, f(C) = 1, f(G) = 2, f(T) = 3, and f(N) = 4.
It is used to index arrays using olinucleotides by translating them into
integers.
|
void MarkovScan(char* sequence,
gparam* gp,
float* OligoDistIni[3],
float* OligoDistTran[3],
long l1, long l2) |
Score exons: pre-processing step. Exons are scored by using a Markov model:
initial matrix and transition matrices. Score of a given exon is: score
assigned for the first pentanucleotide (initial value) plus score computed
for the hexanucleotides content derived from codon bias (transition values).
To compute the transtion value, the accumulated values of scores for every
possible (3) subsequence into the original input are computed. Then, to score
an exon, the rest between the accumulated values for its 2 ends or delimiting
positions must be computed. In this way, scoring one exon is executed with
a constant cost instead of scanning the whole exon (linear), for every exon
(linear time versus quadratic).
|
void HSPScan(packExternalInformation* external,
packHSP* hsp,
int Strand,
long l1, long l2) |
[Optional]. If homology information about the input sequence is provided,
projection of HSPs overlapping current fragment of DNA is performed into
an array of l2 - l1 +1 positions.
|
void HSPScan2(packExternalInformation* external,
packHSP* hsp,
int Strand,
long l1, long l2) |
[Optional]. If homology information about the input sequence is provided,
the array containing HSP projection is preprocessed to save precomputed
sums of every subset of positions in the current sequence fragment.
|
float ScoreHSPexon(exonGFF* exon,
int Strand,
packExternalInformation* external,
long l1, long l2) |
[Optional]. If homology information about the input sequence is provided,
exons supported (total or partial intersection) by homology regions increase their
score proportionally. HSPs are similarity to protein regions (projections
over the sequence of blast High-scoring Segment Pairs in which, the
best score for every position is recorded). .
|
long Score(exonGFF *Exons,
long nExons,
long l1,
long l2,
int Strand,
packExternalInformation* external,
packHSP* hsp,
gparam** isochores,
packGC* GCInfo) |
Score exons: computing step. For every exon from the input, computing the
coding potential score (Markov chain) by using a specific isochore
according to its G+C content. Homology to protein score is computed from
a provided list of HSPs. The final exon score result of a weighted
combination (site, exon and homology factors) between coding potential
score and score from both signals, plus homology score. There are
cutoff points for the coding potential and final scores. It returns
the number of exons overcoming the filter.
|
void ScoreExons(char *Sequence,
packExons* allExons,
long l1,
long l2,
int Strand,
packExternalInformation* external,
packHSP* hsp,
gparam** isochores,
int nIsochores,
packGC* GCInfo) |
Main exon scoring routine. control the data flow.
|
Description: |
Management of how many annotations are going to be included in the
current split together with ab initio predictions, depending on their
positions (range l1,l2).
|
Briefing: |
void SearchEvidenceExons(packExternalInformation* p,
packEvidence* evidence,
long l2) |
Extracting the block (list) of annotations useful to
integrate in the current fragment with the ab initio predictions.
This block is defined by the counters i1vExons and i2vExons.
|
void SwitchCounters(packExternalInformation* p) |
Prepare to process the next block of annotations from i2vExons up to forward
in the following fragment of sequence.
|
void resetEvidenceCounters(packExternalInformation* p) |
First block of annotations to be processed.
|
Description: |
Estimate the amount of predicted signals and exons in a subsequence of length
min(L,LENGTHSi), from the defined values (RSITES, REXONS) in the include file.
Whether L is longer than LENGTHSi or not, space to backup some inter-fragments
information would be required as well because the sequence will be divided
and processed into (L / LENGTHSi) splits.
|
Briefing: |
void SetRatios(long* NUMSITES,
long* NUMEXONS,
long* MAXBACKUPSITES,
long* MAXBACKUPEXONS,
long L) |
From the defined values (geneid.h) RSITES, REXONS, RBSITES, RBEXONS,
and length of the input sequence, an estimation for the amount of
signals and exons of (every type) is computed, in order to ask for
enough memory to allocate them, producing the values NUMSITES, NUMEXONS,
MAXBACKUPSITES, MAXBACKUPEXONS.
(Note: NUMEXONS will be modified
before asking the memory, by using the ratios RSINGL, RFIRST, RINTER, RTERMI and
RORF according to the type of exon because some types are more likely to find
than others)
|
Description: |
Sorting predicted exons of every type (first, internal, terminal, single and
ORFs) and place all of them in the array used to assemble the best genes
(maximum amount: NUMEXONS * FSORT).
|
Briefing: |
void FreeItems(struct exonitem *q) |
Free memory allocated for a list of nodes (exon, next node) in a recursive
way.
|
void InsertFirstGhostExon(exonGFF* Exons) |
Insertion of an artificial initial feature to force prediction
of a complete gene in the current sequence.
|
void InsertLastGhostExon(exonGFF* Exons, long n, long L) |
Insertion of an artificial terminal feature to force prediction
of a complete gene in the current sequence.
|
void UpdateList(struct exonitem** p, exonGFF* InputExon) |
Insert the input exon into a list according to the its position: create a
new node and insert it at the end of the list. Note: all of the exons of the
list begin at the same nucleotide.
|
void SortExons(packExons* allExons,
packExons* allExons_r,
packExternalInformation* external,
packEvidence* pv,
exonGFF* Exons,
long l1, long l2) |
Sorting all of predicted exons by using this method: insert every exon in
a table of lists ExonList (range 1:Length of the fragment) where every position
will contain a list of exons which begin in the same position. Then, screen
the table, picking up the exons from every list and place them into the final
array preserving the same order of traversing the table. If the option -F has
been activated, artificial initial and terminal features are integrated in the
array in the exons coming from first and last sequence fragments.
|
Description: |
Module to implement a quicksort routine in charge of sorting the HSPs
|
Briefing: |
int Split(HSP** hsps,int i, int j)<br>
int RSplit(HSP** hsps,int i, int j) |
Fix a pivot element and then splits the input array HSP(i..j)
into the elements with a value lower (left half) and higher (right half).
The R (reverse) routine is designed to obtain the reverse sorting.
|
void quickSort(HSP** hsps, int i, int j)<br>
void RquickSort(HSP** hsps, int i, int j) |
Main recursive routine of quicksort. The original array is recursively
divided into fragments of elements which will be sorted following a
divide and conquer algorithm. The R (reverse) routine is designed to
obtain the reverse sorting.
|
void SortHSPs(packHSP* p) |
Main routine of the module to call quicksort for sorting HSPs in each
frame and strand.
|
Description: |
To exchange frame and remainder in reverse exons is necessary because
while during exon construction, frame and remainder are dependent on the
sense of reading (frame for the first uncomplete codon, following the
reading sense, and remainder for the end), in genamic assembling frame is
always associated to the left (geographical) uncomplete codon and
remainder to the right one. Thus, gene assembling is reduced to assemble
the pieces according only to some restrictions such as allowed rules
(gene model) or frame-remainder constraints, forgetting different exon
properties.
|
Briefing: |
void SwitchFrames(exonGFF* e, long n) |
Exchanging/restore frame and remainder in the exons predicted during current
sequence fragment (only reverse exons).
|
void SwitchFramesDa(packGenes* pg, int nclass) |
Exchanging frame and remainder in the exons saved during previous fragment
processing into the d-array (only reverse exons). Using the flag selected,
because some exons might be in more than one list and this change must be
only once. This exchange will be restored after finishing genamic processing.
|
void SwitchFramesDb(packGenes* pg, int nclass) |
Restore frame and remainder only in the exons saved during previous fragment
processing into the d-array (only reverse exons). Using the flag selected,
because some exons might be in more than one list and this change must be
only once. (Note: Exons predicted in the current fragment will not have
raised the flag selected. Thus, although they are right now in the d-array,
their frame and remainder will not change during this function)
|
void UndoFrames(exonGFF* e, long n) |
Restore frame and remainder in the input set of exons when they have been
loaded from a gff file (partial prediction, only assembling).
|
Description: |
To exchange left and right signals to have exons matching: left signal <
right signal. This property is required to work with the exons predicted
in both strands, but forgetting their strand and focusing in their properties
as pieces to assemble (exon type, strand, score) making up genes.
|
Briefing: |
void SwitchPositions(packExons* allExons) |
Exchange left (acceptor) signal and right (donor) signal in every type
of exons predicted by reading the reverse sense sequence.
|
Description: |
This module implements the translation of genomic sequences
(exons and genes) into proteins by using the Genetic Code (amino acid
dictionary). The function to extract the exonic nucleotides is provided
as well.
|
Briefing: |
int Translate(long p1,
long p2,
short fra,
short rmd,
char* s,
dict* dAA,
char sAux[]) |
Translation of exons: obtaining the protein product by applying the Genetic
code to the input exon sequence. First and last uncomplete codons are preserved
(in lower case) corresponding to frame and remainder, respectively. Complete
codons are translated by using the dictionary of amino acids. It returns the
number of amino acids (including uncomplete codons) and the protein.
|
void TranslateGene(exonGFF* e,
char* s,
dict* dAA,
long nExons,
int** tAA,
char* prot,
long* nAA) |
Translate a whole gene: translation of the list of linked exons that form
the gene, taking care with the connections (frame/remainder) which are
shared between 2 consecutive exons. Two different processes accoriding
to the gene strand.
|
void GetcDNA(exonGFF* e,
char* s,
long nExons,
char* cDNA,
long* nNN) |
Traversing the exons of the gene, picking up the nucleotides and joining
them building the exonic sequence.
|
Description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Definitions of constants, data types and headers of functions used outside
the module which contains the implementation.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Briefing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Previous Chapter · . Next Chapter
Enrique Blanco Garcia © 2003