Skip to content

Built and trained a Transformer-based model that classifies a given DNA sequence (e.g., "GATTACA...") as either a "promoter" (class 1) or a "non-promoter" (class 0). This is a binary classification task on biological sequence data using NLP architecture.

License

Notifications You must be signed in to change notification settings

harshrajhrj/genomic-sequence-classification-with-an-nlp-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Sequence Classifier with DNABERT

Ever wondered if you can teach a computer to read and understand DNA? This project does just that. It uses a state-of-the-art Natural Language Processing model, DNABERT, to classify human DNA sequences and identify functional "promoter" regions, which are critical for gene activation.

This isn't just about getting a prediction; it's about using the model's internal "attention" mechanism to understand why it made its decision, highlighting the key DNA motifs that are biologically significant.

Key Features

  • Promoter Classification: A binary classifier that distinguishes between promoter and non-promoter DNA sequences.
  • State-of-the-Art Model: Leverages a pre-trained DNABERT model, which understands the "language" of genomics.
  • Fine-Tuning: Built with PyTorch and the Hugging Face transformers library for efficient fine-tuning.
  • Interpretable AI: Includes the ability to visualize model attention, turning the "black box" into a tool for scientific insight.

How It Works

The model treats DNA as a language. A raw DNA sequence like GATTACA... is broken down into overlapping "words" of 6 letters (6-mers) by a specialized tokenizer. The pre-trained DNABERT model, which has already learned the fundamental grammar of DNA from massive genomic databases, is then fine-tuned on our specific task of promoter identification. This approach is highly effective and requires significantly less training time than starting from scratch.


Getting Started

Follow these steps to get the project up and running on your local machine.

Prerequisites

  • Python 3.8+
  • PyTorch
  • A CUDA-enabled GPU is highly recommended for training.

Installation

  1. Clone the repository:

    git clone https://github.com/harshrajhrj/genomic-sequence-classification-with-an-nlp-transformer.git
    cd genomic-sequence-classification-with-an-nlp-transformer
  2. Install dependencies: It's recommended to use a virtual environment.

    pip install -r requirements.txt
  3. Download the dataset: This project uses the "Human Gene Promoter and Non-Promoter Sequences" dataset.

    • Download it from Kaggle.
    • Download the files
      • NonPromoterSequence.txt
      • PromoterSequence.txt
    • Create dataset folder under root directory and place these files in the dataset directory.
    • Run the following command to convert the raw text files into CSV format:
      cd utils
      python util.py
      python merge_csv.py

Usage

The main script for training the model is train.py.

Training the Model

To start fine-tuning the DNABERT model on the promoter dataset, run the following command in your terminal:

cd model
python train.py --epochs 5 --batch_size 32 --learning_rate 2e-5

The script will handle data preprocessing, training, and evaluation. It will print the progress for each epoch and save the best performing model weights as dnabert_promoter_best_model.bin.

Making Predictions with the Trained Model

Once training is complete, you can easily use the saved model to make predictions on new DNA sequences. Here's a quick example snippet:

Prediction

# Example of a real human promoter sequence (from TATA-box)
promoter_example = "cgcgcccgcgccgcatatacgcgtatatacgcgtatacgcgtatacgcgtacgcgta"

# Example of a random, non-promoter-like sequence
non_promoter_example = "atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatc"


pred_label, confidence = predict_sequence(promoter_example)
print(f"Sequence: {promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}\n")


pred_label, confidence = predict_sequence(non_promoter_example)
print(f"Sequence: {non_promoter_example[:30]}...")
print(f"Predicted Label: {pred_label}")
print(f"Confidence: {confidence:.4f}")

Output

Sequence: cgcgcccgcgccgcatatacgcgtatatac...
Predicted Label: Non-Promoter
Confidence: 0.8672

Sequence: atcgatcgatcgatcgatcgatcgatcgat...
Predicted Label: Non-Promoter
Confidence: 0.8747

Future Work

This project is a great foundation. Here are a few ideas for extending it:

  • Build a simple web interface with Streamlit or Flask to make predictions interactively.
  • Experiment with other pre-trained genomic models.
  • Expand the classifier to handle multi-class problems (e.g., classifying different types of genomic elements).

About

Built and trained a Transformer-based model that classifies a given DNA sequence (e.g., "GATTACA...") as either a "promoter" (class 1) or a "non-promoter" (class 0). This is a binary classification task on biological sequence data using NLP architecture.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published