Project BUGS: AI-Powered Biodiversity Discovery from eDNA

A project for the Smart India Hackathon 2025.

Problem Statement ID: 25042
Problem Statement Title: Identifying Taxonomy and Assessing Biodiversity from eDNA Datasets
Theme: Software

The Problem

Traditional methods for analyzing environmental DNA (eDNA) face significant hurdles that limit the scope and speed of biodiversity discovery. Key challenges include:

Incomplete Databases: A vast amount of deep-sea eDNA reads remain unmatched because reference databases are incomplete.
Data Loss: Unknown or low-confidence reads are often discarded or vaguely labeled, resulting in a loss of potentially novel discoveries.
Computational Bottlenecks: Alignment-based tools like BLAST are too slow and computationally expensive for the massive datasets generated by modern expeditions.
Limited Ecological Insight: Analysis often stops at taxonomic lists, failing to reveal deeper ecosystem structures, gradients, or patterns.
Slow Validation: Confirming computationally predicted novel organisms requires a slow and costly biological verification process, creating a significant bottleneck.

Our Solution

Project BUGS introduces a Hybrid AI Pipeline that revolutionizes eDNA analysis by combining supervised classification with unsupervised discovery. [cite: 31, 32] Our system accurately identifies known species while simultaneously discovering and clustering potential novel taxa from unknown sequences.

Instead of discarding unidentifiable reads, our "Discovery-First" design treats them as opportunities for discovery, enabling a more complete and insightful picture of biodiversity.

Key Features & Innovation

Genomic Transformer (DNABERT-2): We use DNABERT-2 to create rich numerical representations (embeddings) of DNA sequences, allowing for pattern recognition that is independent of reference databases.
Hybrid AI Pipeline: The system combines a supervised stream for rapid classification of known taxa with an unsupervised stream for the discovery of novel organisms.
Unsupervised Novelty Engine: Low-confidence and unknown reads are clustered using a powerful combination of UMAP for dimensionality reduction and HDBSCAN for density-based clustering to identify potential new species.
Ecology-Aware Outputs: The platform moves beyond simple taxonomic lists to provide valuable ecological insights, including biodiversity ordinations, gradients, and novelty alerts.
Scalable & Deployable: The architecture supports lightweight analysis for onboard ship results and deeper, full-scale discovery in the cloud.

Technical Architecture

Our model is built on a three-phase architecture:

Phase 1: Feature Extraction: A raw DNA sequence is tokenized and fed into the DNABERT-2 model. [cite: 61, 64, 67] [cite_start]This phase outputs a high-dimensional "Feature Matrix X1," which is a rich numerical representation of the sequence.
Phase 2: Supervised Classification: The feature matrix is passed to a neural network classifier. [cite_start]This stream provides a prediction, taxonomic label, and confidence level for known sequences using a SoftMax output.
Phase 3: Unsupervised Discovery: The same feature matrix is also sent to an unsupervised stream. [cite_start]UMAP reduces its dimensionality, and HDBSCAN performs clustering to group similar unknown sequences into potential novel taxa.

Technology Stack

Core Language & Data Handling
- Python: The primary language for the project.
- Biopython: Used for parsing and handling FASTA files.
- Pandas: For managing datasets of embeddings and taxonomic information.
Machine Learning & GPU Acceleration
- Transformers (Hugging Face, PyTorch): To implement the DNABERT-2 model for generating sequence embeddings.
- NVIDIA RAPIDS (cuML, cuDF): For GPU-accelerated DataFrames and machine learning.
- HDBSCAN: The core algorithm for density-based clustering to detect novel taxa.
Prototyping & Visualization
- UMAP (cuML): For fast, GPU-accelerated dimensionality reduction.
- Matplotlib & Seaborn: To create high-quality visualizations of biodiversity maps and clusters.

Team BUGS

Shanvi Verma
Ishita Singh
Abhijit Prasad
Arindol Sarkar
Saloni Kushwaha
Atul Gadkoti

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
18_S.py		18_S.py
DNABERT.py		DNABERT.py
LICENSE		LICENSE
README.md		README.md
conversion_For_meta.py		conversion_For_meta.py
dataextract.py		dataextract.py
dna.py		dna.py
format_fasta.py		format_fasta.py
k_mer_conv.py		k_mer_conv.py
reading_data.py		reading_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project BUGS: AI-Powered Biodiversity Discovery from eDNA

The Problem

Our Solution

Key Features & Innovation

Technical Architecture

Technology Stack

Team BUGS

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

sys6-exe/Bluegenome

Folders and files

Latest commit

History

Repository files navigation

Project BUGS: AI-Powered Biodiversity Discovery from eDNA

The Problem

Our Solution

Key Features & Innovation

Technical Architecture

Technology Stack

Team BUGS

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages