Skip to content

SobanHM/Spatial-VLM-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spatial-VLM-Assistant: Metric Depth-Aware Visual Reasoning

A state-of-the-art Assistive Spatial AI system that integrates Vision-Language Models (LLaVA v1.5) with Metric Depth Transformers (Depth Anything V2) to provide real-world spatial awareness.

πŸš€ The Vision

Standard VLMs can describe "what" is in a scene but are "space-blind"β€”they cannot accurately estimate distance. This project bridges that gap for assistive navigation (e.g., helping a user navigate a supermarket) by providing real-time metric distance and orientation of objects.

πŸ› οΈ Tech Stack & Methodology

  • Brain (Semantic): LLaVA v1.5 (7B) via 4-bit Quantization (GPU 0)
  • Eyes (Geometric): Depth Anything V2 - ViT-Large (GPU 1)
  • Dataset: NYU Depth V2 (Metric Validation)
  • Hardware Strategy: Dual-GPU Model Parallelism (2x RTX Super 8GB)

πŸ“ˆ Methodology Workflow

  1. Metric Calibration: Validating Depth Anything V2 against NYU Ground Truth .mat files to ensure sub-10cm accuracy.
  2. Object-to-Depth Fusion: Extracting LLaVA bounding boxes and mapping them to Median Metric Depth maps.
  3. Spatial Prompting: Injecting physical coordinates into the LLM context for "Embodied Reasoning."

πŸ“‚ Project Structure

  • src/: Core logic for dual-GPU inference and spatial fusion.
  • notebooks/: NYU Depth Dataset validation and metric benchmarking.
  • models/: Checkpoints for Depth Anything V2 (Metric Hypersim).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages