Ever wondered if you can teach a computer to read and understand DNA? This project does just that. It uses a state-of-the-art Natural Language Processing model, DNABERT, to classify human DNA sequences and identify functional "promoter" regions, which are critical for gene activation.
This isn't just about getting a prediction; it's about using the model's internal "attention" mechanism to understand why it made its decision, highlighting the key DNA motifs that are biologically significant.
- Promoter Classification: A binary classifier that distinguishes between promoter and non-promoter DNA sequences.
- State-of-the-Art Model: Leverages a pre-trained DNABERT model, which understands the "language" of genomics.
- Fine-Tuning: Built with PyTorch and the Hugging Face
transformerslibrary for efficient fine-tuning. - Interpretable AI: Includes the ability to visualize model attention, turning the "black box" into a tool for scientific insight.
The model treats DNA as a language. A raw DNA sequence like GATTACA... is broken down into overlapping "words" of 6 letters (6-mers) by a specialized tokenizer. The pre-trained DNABERT model, which has already learned the fundamental grammar of DNA from massive genomic databases, is then fine-tuned on our specific task of promoter identification. This approach is highly effective and requires significantly less training time than starting from scratch.
Follow these steps to get the project up and running on your local machine.
- Python 3.8+
- PyTorch
- A CUDA-enabled GPU is highly recommended for training.
-
Clone the repository:
git clone https://github.com/harshrajhrj/genomic-sequence-classification-with-an-nlp-transformer.git cd genomic-sequence-classification-with-an-nlp-transformer -
Install dependencies: It's recommended to use a virtual environment.
pip install -r requirements.txt
-
Download the dataset: This project uses the "Human Gene Promoter and Non-Promoter Sequences" dataset.
- Download it from Kaggle.
- Download the files
NonPromoterSequence.txtPromoterSequence.txt
- Create
datasetfolder under root directory and place these files in thedatasetdirectory. - Run the following command to convert the raw text files into CSV format:
cd utils python util.py python merge_csv.py
The main script for training the model is train.py.
To start fine-tuning the DNABERT model on the promoter dataset, run the following command in your terminal:
cd model
python train.py --epochs 5 --batch_size 32 --learning_rate 2e-5The script will handle data preprocessing, training, and evaluation. It will print the progress for each epoch and save the best performing model weights as dnabert_promoter_best_model.bin.
Once training is complete, you can easily use the saved model to make predictions on new DNA sequences. Here's a quick example snippet:
# Example of a real human promoter sequence (from TATA-box)
promoter_example = "cgcgcccgcgccgcatatacgcgtatatacgcgtatacgcgtatacgcgtacgcgta"
# Example of a random, non-promoter-like sequence
non_promoter_example = "atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatc"
pred_label, confidence = predict_sequence(promoter_example)
print(f"Sequence: {promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}\n")
pred_label, confidence = predict_sequence(non_promoter_example)
print(f"Sequence: {non_promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}")Sequence: cgcgcccgcgccgcatatacgcgtatatac...
Predicted Label: Non-Promoter
Confidence: 0.8672
Sequence: atcgatcgatcgatcgatcgatcgatcgat...
Predicted Label: Non-Promoter
Confidence: 0.8747
This project is a great foundation. Here are a few ideas for extending it:
- Build a simple web interface with Streamlit or Flask to make predictions interactively.
- Experiment with other pre-trained genomic models.
- Expand the classifier to handle multi-class problems (e.g., classifying different types of genomic elements).