Fine-tuning DNA language models (Nucleotide Transformers) for genomic prediction tasks including promoter/enhancer identification, variant effect prediction, and more.
This project applies transformer-based language models pre-trained on genomic sequences to various downstream prediction tasks, leveraging DNA sequence representations for biological insights.
Identifies regulatory DNA elements using sequence context.
Predicts the functional impact of genomic variants on:
- Gene expression
- Protein function
- Disease association
Associates genetic variants with traits and diseases.
Identifies DNA motifs and binding sites for transcription factors.
Predicts histone modifications and chromatin states.
Models chromatin accessibility and 3D structure from sequence alone.
- Base Model: Nucleotide Transformers (pre-trained on genomic sequences)
- Fine-tuning Approach: Task-specific adaptation with frozen/unfrozen layers
- Evaluation Metrics: AUC, F1-score, correlation for classification/regression tasks
DNA-Language-Model/
├── data/ # Genomic datasets (FASTA, BED formats)
├── models/ # Pre-trained and fine-tuned models
├── notebooks/ # Analysis notebooks
├── scripts/
│ ├── preprocess.py # Sequence preprocessing
│ ├── finetune.py # Model fine-tuning pipeline
│ ├── predict.py # Inference on new sequences
│ └── evaluate.py # Performance metrics
└── results/ # Predictions and analysis outputs
conda create -n dna-lm python=3.10
conda activate dna-lm
pip install torch transformers huggingface-hub biopython pandas numpyfrom transformers import AutoTokenizer, AutoModel
import torch
# Load pre-trained Nucleotide Transformer
model_name = "InstaDeep/nucleotide-transformer-v2-500m-human-ref"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Tokenize sequence
sequence = "ACGTACGTACGTACGT"
inputs = tokenizer(sequence, return_tensors="pt", padding=True)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state# Promoter prediction fine-tuning
python scripts/finetune.py \
--model_name "InstaDeep/nucleotide-transformer-v2-500m-human-ref" \
--task "promoter_detection" \
--data_path "data/promoter_sequences.fasta" \
--output_dir "models/promoter_model" \
--epochs 5 \
--batch_size 8- Nucleotide Transformers: bioRxiv
- DNABERT: bioRxiv
- Enformer: Nature Methods
When adding new prediction tasks:
- Document the biological rationale
- Provide benchmark datasets
- Include baseline performance metrics
- Test on held-out validation sets
For questions about specific genomic tasks or model applications, refer to task-specific scripts.