-
Notifications
You must be signed in to change notification settings - Fork 0
Home
With the advances in the Next Generation Sequencing technologies, the number of identified gene sequences is growing exponentially. The function of most of these newly identified sequences remains unknown. Different bioinformatics approaches have been proposed to help experimental techniques to annotate protein function. Among them, Phylogenetic Profiling (PP) is a method that exploits the evolutionary co-occurrence pattern to identify functional related proteins. The identification of these phylogenetic “profiles” allows us to infer the function of uncharacterized proteins from others already annotated.
However, the correlation of these patterns was normally performed by clustering methods. In this project, we propose combining a Phylogenetic Profiling analysis with the learning ability of Machine Learning methods in order to predict protein function in the form of GO terms. A pipeline was developed to test different models and filters in a High-Performance Computing environment. Among the six supervised classification algorithms evaluated, Random Forest method provided the most accurate predictions, with a mean F1-score of 0.61. The same data was used to predict potential human disease genes and the obtained results were very similar. Although this performance is still far for the needs of scientific community, this study validates Phylogenetic Profiling as a method that can be combined with other approaches to further improve the performance of protein function assignment