A genetic algorithm used to identify regions of regularity in the genome of Drosophila melanogaster. This algorithm has been tested on previously published ChIP data for the transcription factor, Myocyte-enhancer factor 2 (MEF2).
Genetic algorithms (GA) are algorithms in which the basic principles behind Darwinian evolution are applied algorithmically to a computational probelm of interest. This approach can generate results by simply applying the idea of "survival of the fittest".
- Chromosomes store genetic information
- Individuals in a population that are deemed “most genetically fit” prevail
- Those that do not, either die off or undergo mutation/crossover
- Those that are the most elite continue their lineage in the next generation
- This process is repeated for several generations, until the population evolves and equilibrium is reached
These five concepts can be applied programmatically to many scenarios in order to solve complex problems. Some examples of where genetic algorithms have been very successful are (but not limited to):
- Software bug detection and repair (Forrest et al 2009, Le Goues et al 2013)
- Medicine and Biology (Ghaheri et al 2015, Hackenberger, 2019, )
- Bioinformatics & data science (Manning et al 2013, Yang and Honavar)
- Film, movies, and gaming (Trescak et al 2012, Sanjuan et al 2007)
ConSeGA is a genetic algorithm approach for detecting consensus sequences in biological data sets.
- The algorithm defines the set of set of "individuals" or "chromosomes" as: a set of strings of 15 characters in length comprised of only A, T, G, and C. In the case of the algorithm this was derived from MEF2-ChIP data ().
- The method to determine how "fit" an individual is, or fitness function, is determed by the window of string with a sum of the 10 lowest consecutive entopy values. Here is the entropy sliding window calculation:
It is then followed by the overall sum of entropies sliding window calculation:
- The most fit window of strings is automatically excluded from mutation.
-
Any string that is not in the most fit windows is subject to random mutation. A mutation in this case, is a random position shift in the alignment array of +/-3 base pairs.
-
Elitism is use to ensure the most fit window of string is automatically incorporated into the next generation of string alignments.
-
This process is repeated or "evolved" for 3000 generations.
ConSeGA was written in Python version 2.7. The following Python dependecies are required:
- NumPy
- Matplotlib
The main python function to run ConSeGA is tf_genetic_algorithm_multFunc_version2.py
. It requires an input text file of "chromsomes" which are kmer sequences. There will be a command line option of arguments, in order to update generations, array size, kmer length, and input file location, however, for the moment this file is hardcoded to mimic what we have validated. These parameters can be updated within the code at line 23 for your input file and the size of kmers, arrays, and generations can also be updated in the code and then run.
python2.7 tf_genetic_algorithm_multFunc_version2.py