A project for the Smart India Hackathon 2025.
- Problem Statement ID: 25042
- Problem Statement Title: Identifying Taxonomy and Assessing Biodiversity from eDNA Datasets
- Theme: Software
Traditional methods for analyzing environmental DNA (eDNA) face significant hurdles that limit the scope and speed of biodiversity discovery. Key challenges include:
- Incomplete Databases: A vast amount of deep-sea eDNA reads remain unmatched because reference databases are incomplete.
- Data Loss: Unknown or low-confidence reads are often discarded or vaguely labeled, resulting in a loss of potentially novel discoveries.
- Computational Bottlenecks: Alignment-based tools like BLAST are too slow and computationally expensive for the massive datasets generated by modern expeditions.
- Limited Ecological Insight: Analysis often stops at taxonomic lists, failing to reveal deeper ecosystem structures, gradients, or patterns.
- Slow Validation: Confirming computationally predicted novel organisms requires a slow and costly biological verification process, creating a significant bottleneck.
Project BUGS introduces a Hybrid AI Pipeline that revolutionizes eDNA analysis by combining supervised classification with unsupervised discovery. [cite: 31, 32] Our system accurately identifies known species while simultaneously discovering and clustering potential novel taxa from unknown sequences.
Instead of discarding unidentifiable reads, our "Discovery-First" design treats them as opportunities for discovery, enabling a more complete and insightful picture of biodiversity.
- Genomic Transformer (DNABERT-2): We use DNABERT-2 to create rich numerical representations (embeddings) of DNA sequences, allowing for pattern recognition that is independent of reference databases.
- Hybrid AI Pipeline: The system combines a supervised stream for rapid classification of known taxa with an unsupervised stream for the discovery of novel organisms.
- Unsupervised Novelty Engine: Low-confidence and unknown reads are clustered using a powerful combination of UMAP for dimensionality reduction and HDBSCAN for density-based clustering to identify potential new species.
- Ecology-Aware Outputs: The platform moves beyond simple taxonomic lists to provide valuable ecological insights, including biodiversity ordinations, gradients, and novelty alerts.
- Scalable & Deployable: The architecture supports lightweight analysis for onboard ship results and deeper, full-scale discovery in the cloud.
Our model is built on a three-phase architecture:
- Phase 1: Feature Extraction: A raw DNA sequence is tokenized and fed into the DNABERT-2 model. [cite: 61, 64, 67] [cite_start]This phase outputs a high-dimensional "Feature Matrix X1," which is a rich numerical representation of the sequence.
- Phase 2: Supervised Classification: The feature matrix is passed to a neural network classifier. [cite_start]This stream provides a prediction, taxonomic label, and confidence level for known sequences using a SoftMax output.
- Phase 3: Unsupervised Discovery: The same feature matrix is also sent to an unsupervised stream. [cite_start]UMAP reduces its dimensionality, and HDBSCAN performs clustering to group similar unknown sequences into potential novel taxa.
- Core Language & Data Handling
- Python: The primary language for the project.
- Biopython: Used for parsing and handling FASTA files.
- Pandas: For managing datasets of embeddings and taxonomic information.
- Machine Learning & GPU Acceleration
- Transformers (Hugging Face, PyTorch): To implement the DNABERT-2 model for generating sequence embeddings.
- NVIDIA RAPIDS (cuML, cuDF): For GPU-accelerated DataFrames and machine learning.
- HDBSCAN: The core algorithm for density-based clustering to detect novel taxa.
- Prototyping & Visualization
- UMAP (cuML): For fast, GPU-accelerated dimensionality reduction.
- Matplotlib & Seaborn: To create high-quality visualizations of biodiversity maps and clusters.
- Shanvi Verma
- Ishita Singh
- Abhijit Prasad
- Arindol Sarkar
- Saloni Kushwaha
- Atul Gadkoti