The goal of the CAFA competition is to predict the function of a set of proteins. We developed a model trained on the amino-acid sequences of the proteins to predict proteins functions by performing multi-label classification with the Gene Ontology (GO) terms as labels. This work will help researchers better understand the function of proteins, which is important for discovering how cells, tissues, and organs work. This may also aid in the development of new drugs and therapies for various diseases.
Read more about this competition here: Kaggle link
We used the following Dataset: link
- BLAST
- K-Nearest Neighbours
- Random Forest
- XGBoost
- Dimension Reduction techniques (PCA, autoencoder)
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN, LSTM)
- Multilayer Perceptrons (with embeddings from pretrained transformers, like T5, ProtBERT, ESM2) [BEST Performing]
-
Kaushik Raj Nadar (Mentor) - Report