A state-of-the-art Assistive Spatial AI system that integrates Vision-Language Models (LLaVA v1.5) with Metric Depth Transformers (Depth Anything V2) to provide real-world spatial awareness.
Standard VLMs can describe "what" is in a scene but are "space-blind"βthey cannot accurately estimate distance. This project bridges that gap for assistive navigation (e.g., helping a user navigate a supermarket) by providing real-time metric distance and orientation of objects.
- Brain (Semantic): LLaVA v1.5 (7B) via 4-bit Quantization (GPU 0)
- Eyes (Geometric): Depth Anything V2 - ViT-Large (GPU 1)
- Dataset: NYU Depth V2 (Metric Validation)
- Hardware Strategy: Dual-GPU Model Parallelism (2x RTX Super 8GB)
- Metric Calibration: Validating Depth Anything V2 against NYU Ground Truth
.matfiles to ensure sub-10cm accuracy. - Object-to-Depth Fusion: Extracting LLaVA bounding boxes and mapping them to Median Metric Depth maps.
- Spatial Prompting: Injecting physical coordinates into the LLM context for "Embodied Reasoning."
src/: Core logic for dual-GPU inference and spatial fusion.notebooks/: NYU Depth Dataset validation and metric benchmarking.models/: Checkpoints for Depth Anything V2 (Metric Hypersim).