Genomic Sequence Classifier with DNABERT

Ever wondered if you can teach a computer to read and understand DNA? This project does just that. It uses a state-of-the-art Natural Language Processing model, DNABERT, to classify human DNA sequences and identify functional "promoter" regions, which are critical for gene activation.

This isn't just about getting a prediction; it's about using the model's internal "attention" mechanism to understand why it made its decision, highlighting the key DNA motifs that are biologically significant.

Key Features

Promoter Classification: A binary classifier that distinguishes between promoter and non-promoter DNA sequences.
State-of-the-Art Model: Leverages a pre-trained DNABERT model, which understands the "language" of genomics.
Fine-Tuning: Built with PyTorch and the Hugging Face transformers library for efficient fine-tuning.
Interpretable AI: Includes the ability to visualize model attention, turning the "black box" into a tool for scientific insight.

How It Works

The model treats DNA as a language. A raw DNA sequence like GATTACA... is broken down into overlapping "words" of 6 letters (6-mers) by a specialized tokenizer. The pre-trained DNABERT model, which has already learned the fundamental grammar of DNA from massive genomic databases, is then fine-tuned on our specific task of promoter identification. This approach is highly effective and requires significantly less training time than starting from scratch.

Getting Started

Follow these steps to get the project up and running on your local machine.

Prerequisites

Python 3.8+
PyTorch
A CUDA-enabled GPU is highly recommended for training.

Installation

Clone the repository:

git clone https://github.com/harshrajhrj/genomic-sequence-classification-with-an-nlp-transformer.git
cd genomic-sequence-classification-with-an-nlp-transformer

Install dependencies: It's recommended to use a virtual environment.
```
pip install -r requirements.txt
```
Download the dataset: This project uses the "Human Gene Promoter and Non-Promoter Sequences" dataset.
- Download it from Kaggle.
- Download the files
  - NonPromoterSequence.txt
  - PromoterSequence.txt
- Create dataset folder under root directory and place these files in the dataset directory.
- Run the following command to convert the raw text files into CSV format:
```
cd utils
python util.py
python merge_csv.py
```

Usage

The main script for training the model is train.py.

Training the Model

To start fine-tuning the DNABERT model on the promoter dataset, run the following command in your terminal:

cd model
python train.py --epochs 5 --batch_size 32 --learning_rate 2e-5

The script will handle data preprocessing, training, and evaluation. It will print the progress for each epoch and save the best performing model weights as dnabert_promoter_best_model.bin.

Making Predictions with the Trained Model

Once training is complete, you can easily use the saved model to make predictions on new DNA sequences. Here's a quick example snippet:

Prediction

# Example of a real human promoter sequence (from TATA-box)
promoter_example = "cgcgcccgcgccgcatatacgcgtatatacgcgtatacgcgtatacgcgtacgcgta"

# Example of a random, non-promoter-like sequence
non_promoter_example = "atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatc"


pred_label, confidence = predict_sequence(promoter_example)
print(f"Sequence: {promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}\n")


pred_label, confidence = predict_sequence(non_promoter_example)
print(f"Sequence: {non_promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}")

Output

Sequence: cgcgcccgcgccgcatatacgcgtatatac...
Predicted Label: Non-Promoter
Confidence: 0.8672

Sequence: atcgatcgatcgatcgatcgatcgatcgat...
Predicted Label: Non-Promoter
Confidence: 0.8747

Future Work

This project is a great foundation. Here are a few ideas for extending it:

Build a simple web interface with Streamlit or Flask to make predictions interactively.
Experiment with other pre-trained genomic models.
Expand the classifier to handle multi-class problems (e.g., classifying different types of genomic elements).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genomic Sequence Classifier with DNABERT

Key Features

How It Works

Getting Started

Prerequisites

Installation

Usage

Training the Model

Making Predictions with the Trained Model

Prediction

Output

Future Work

About

Uh oh!

Releases

Packages

Languages

License

harshrajhrj/genomic-sequence-classification-with-an-nlp-transformer

Folders and files

Latest commit

History

Repository files navigation

Genomic Sequence Classifier with DNABERT

Key Features

How It Works

Getting Started

Prerequisites

Installation

Usage

Training the Model

Making Predictions with the Trained Model

Prediction

Output

Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages