Skip to content

sys6-exe/Bluegenome

Repository files navigation

Project BUGS: AI-Powered Biodiversity Discovery from eDNA

A project for the Smart India Hackathon 2025.

  • Problem Statement ID: 25042
  • Problem Statement Title: Identifying Taxonomy and Assessing Biodiversity from eDNA Datasets
  • Theme: Software

The Problem

Traditional methods for analyzing environmental DNA (eDNA) face significant hurdles that limit the scope and speed of biodiversity discovery. Key challenges include:

  • Incomplete Databases: A vast amount of deep-sea eDNA reads remain unmatched because reference databases are incomplete.
  • Data Loss: Unknown or low-confidence reads are often discarded or vaguely labeled, resulting in a loss of potentially novel discoveries.
  • Computational Bottlenecks: Alignment-based tools like BLAST are too slow and computationally expensive for the massive datasets generated by modern expeditions.
  • Limited Ecological Insight: Analysis often stops at taxonomic lists, failing to reveal deeper ecosystem structures, gradients, or patterns.
  • Slow Validation: Confirming computationally predicted novel organisms requires a slow and costly biological verification process, creating a significant bottleneck.

Our Solution

Project BUGS introduces a Hybrid AI Pipeline that revolutionizes eDNA analysis by combining supervised classification with unsupervised discovery. [cite: 31, 32] Our system accurately identifies known species while simultaneously discovering and clustering potential novel taxa from unknown sequences.

Instead of discarding unidentifiable reads, our "Discovery-First" design treats them as opportunities for discovery, enabling a more complete and insightful picture of biodiversity.

Key Features & Innovation

  • Genomic Transformer (DNABERT-2): We use DNABERT-2 to create rich numerical representations (embeddings) of DNA sequences, allowing for pattern recognition that is independent of reference databases.
  • Hybrid AI Pipeline: The system combines a supervised stream for rapid classification of known taxa with an unsupervised stream for the discovery of novel organisms.
  • Unsupervised Novelty Engine: Low-confidence and unknown reads are clustered using a powerful combination of UMAP for dimensionality reduction and HDBSCAN for density-based clustering to identify potential new species.
  • Ecology-Aware Outputs: The platform moves beyond simple taxonomic lists to provide valuable ecological insights, including biodiversity ordinations, gradients, and novelty alerts.
  • Scalable & Deployable: The architecture supports lightweight analysis for onboard ship results and deeper, full-scale discovery in the cloud.

Technical Architecture

Our model is built on a three-phase architecture:

  • Phase 1: Feature Extraction: A raw DNA sequence is tokenized and fed into the DNABERT-2 model. [cite: 61, 64, 67] [cite_start]This phase outputs a high-dimensional "Feature Matrix X1," which is a rich numerical representation of the sequence.
  • Phase 2: Supervised Classification: The feature matrix is passed to a neural network classifier. [cite_start]This stream provides a prediction, taxonomic label, and confidence level for known sequences using a SoftMax output.
  • Phase 3: Unsupervised Discovery: The same feature matrix is also sent to an unsupervised stream. [cite_start]UMAP reduces its dimensionality, and HDBSCAN performs clustering to group similar unknown sequences into potential novel taxa.

Technology Stack

  • Core Language & Data Handling
    • Python: The primary language for the project.
    • Biopython: Used for parsing and handling FASTA files.
    • Pandas: For managing datasets of embeddings and taxonomic information.
  • Machine Learning & GPU Acceleration
    • Transformers (Hugging Face, PyTorch): To implement the DNABERT-2 model for generating sequence embeddings.
    • NVIDIA RAPIDS (cuML, cuDF): For GPU-accelerated DataFrames and machine learning.
    • HDBSCAN: The core algorithm for density-based clustering to detect novel taxa.
  • Prototyping & Visualization
    • UMAP (cuML): For fast, GPU-accelerated dimensionality reduction.
    • Matplotlib & Seaborn: To create high-quality visualizations of biodiversity maps and clusters.

Team BUGS

  • Shanvi Verma
  • Ishita Singh
  • Abhijit Prasad
  • Arindol Sarkar
  • Saloni Kushwaha
  • Atul Gadkoti

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages