diff --git a/CHANGELOG.md b/CHANGELOG.md index 64b3fe2..3cc4041 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,5 @@ # ๆดๆฐๆฅๅฟ +[๐ฌ๐ง English Version here](CHANGELOG_en.md) ๆฌๆๆกฃ่ฎฐๅฝ้กน็ฎ็ๆๆ้่ฆๅๆดใ diff --git a/CHANGELOG_en.md b/CHANGELOG_en.md new file mode 100644 index 0000000..03d6b79 --- /dev/null +++ b/CHANGELOG_en.md @@ -0,0 +1,91 @@ +# Changelog + +This document records all significant changes made to the project. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and the versioning follows [Semantic Versioning](https://semver.org/). + +## [Unreleased] + +### Added +- First open-source release +- Complete GitHub documentation (README, CONTRIBUTING, LICENSE, etc.) +- Docker support +- Environment variable configuration template + +### Changed +- Improved README structure +- Enhanced code comments + +--- + +## [1.0.0] - 2025-01-XX + +### Added +- ๐ถ **Blind Path Navigation System** + - Real-time tactile paving detection and segmentation + - Intelligent voice guidance + - Obstacle detection and avoidance + - Sharp turn detection and alerts + - Optical flow stabilization + +- ๐ฆ **Crosswalk Assistance** + - Crosswalk recognition and direction detection + - Traffic light color recognition + - Alignment guidance system + - Safety reminders + +- ๐ **Object Recognition and Search** + - YOLO-E open-vocabulary detection + - MediaPipe hand tracking and guidance + - Real-time object tracking + - Grasp action detection + +- ๐๏ธ **Real-Time Voice Interaction** + - Alibaba Paraformer ASR + - Qwen-Omni-Turbo multimodal dialogue + - Intelligent command parsing + - Context awareness + +- ๐น **Video and Audio Processing** + - Real-time WebSocket streaming + - Audio-video synchronized recording + - IMU data fusion + - Multi-channel audio mixing + +- ๐จ **Visualization and Interaction** + - Real-time web monitoring interface + - IMU 3D visualization + - Status dashboard + - Chinese-language interface + +### Tech Stack +- FastAPI + WebSocket +- YOLO11 / YOLO-E +- MediaPipe +- PyTorch + CUDA +- OpenCV +- DashScope API + +### Known Issues +- [ ] Possible lag on low-end GPUs +- [ ] No GPU acceleration support on macOS +- [ ] Some Chinese fonts render incorrectly on Linux + +--- + +## Versioning Guidelines + +### Major +- Incompatible API changes + +### Minor +- Backward-compatible new features + +### Patch +- Backward-compatible bug fixes + +--- + +[Unreleased]: https://github.com/yourusername/aiglass/compare/v1.0.0...HEAD +[1.0.0]: https://github.com/yourusername/aiglass/releases/tag/v1.0.0 diff --git a/PROJECT_STRUCTURE.md b/PROJECT_STRUCTURE.md index 219dfed..f35057f 100644 --- a/PROJECT_STRUCTURE.md +++ b/PROJECT_STRUCTURE.md @@ -1,4 +1,5 @@ # ้กน็ฎ็ปๆ่ฏดๆ +[๐ฌ๐ง English Version here](PROJECT_STRUCTURE_en.md) ๆฌๆๆกฃ่ฏฆ็ป่ฏดๆ้กน็ฎ็็ฎๅฝ็ปๆๅไธป่ฆๆไปถ็ไฝ็จใ diff --git a/PROJECT_STRUCTURE_en.md b/PROJECT_STRUCTURE_en.md new file mode 100644 index 0000000..9f11a8c --- /dev/null +++ b/PROJECT_STRUCTURE_en.md @@ -0,0 +1,398 @@ +# Project Structure Guide + +This document explains the projectโs directory layout and the purpose of the main files. + +## ๐ Directory Structure + +``` +rebuild1002/ +โโโ ๐ Main Application Files +โ โโโ app_main.py # App entry point (FastAPI service) +โ โโโ navigation_master.py # Navigation controller (state machine) +โ โโโ workflow_blindpath.py # Blind-path navigation workflow +โ โโโ workflow_crossstreet.py # Crosswalk navigation workflow +โ โโโ yolomedia.py # Item search workflow +โ +โโโ ๐๏ธ Speech & Audio +โ โโโ asr_core.py # Speech recognition core +โ โโโ omni_client.py # Qwen-Omni client +โ โโโ qwen_extractor.py # Tag extraction (Chinese โ English) +โ โโโ audio_player.py # Audio player +โ โโโ audio_stream.py # Audio stream manager +โ +โโโ ๐ค Models +โ โโโ yoloe_backend.py # YOLO-E backend (open vocabulary) +โ โโโ trafficlight_detection.py # Traffic light detection +โ โโโ obstacle_detector_client.py # Obstacle detector client +โ โโโ models.py # Model definitions +โ +โโโ ๐ฅ Video Processing +โ โโโ bridge_io.py # Thread-safe frame buffers +โ โโโ sync_recorder.py # A/V synchronized recording +โ โโโ video_recorder.py # Legacy video recorder +โ +โโโ ๐ Web Frontend +โ โโโ templates/ +โ โ โโโ index.html # Main UI HTML +โ โโโ static/ +โ โ โโโ main.js # Main JS +โ โ โโโ vision.js # Vision stream handling +โ โ โโโ visualizer.js # Data visualization +โ โ โโโ vision_renderer.js # Rendering +โ โ โโโ vision.css # Styles +โ โ โโโ models/ # 3D models (IMU visualization) +โ +โโโ ๐ต Audio Assets +โ โโโ music/ # System chimes +โ โ โโโ converted_ๅไธ.wav +โ โ โโโ converted_ๅไธ.wav +โ โ โโโ ... +โ โโโ voice/ # Pre-recorded voice lines +โ โโโ voice_mapping.json +โ โโโ *.wav +โ +โโโ ๐ง Model Files +โ โโโ model/ +โ โโโ yolo-seg.pt # Blind-path segmentation model +โ โโโ yoloe-11l-seg.pt # YOLO-E open-vocabulary model +โ โโโ shoppingbest5.pt # Item recognition model +โ โโโ trafficlight.pt # Traffic light model +โ โโโ hand_landmarker.task # MediaPipe hand model +โ +โโโ ๐น Recordings +โ โโโ recordings/ # Auto-saved video & audio +โ โโโ video_*.avi +โ โโโ audio_*.wav +โ +โโโ ๐ ๏ธ ESP32 Firmware +โ โโโ compile/ +โ โโโ compile.ino # Arduino main program +โ โโโ camera_pins.h # Camera pin definitions +โ โโโ ICM42688.cpp/h # IMU driver +โ โโโ ESP32_VIDEO_OPTIMIZATION.md +โ +โโโ ๐งช Tests +โ โโโ test_recorder.py # Recording tests +โ โโโ test_traffic_light.py # Traffic light tests +โ โโโ test_cross_street_blindpath.py # Navigation tests +โ โโโ test_crosswalk_awareness.py # Crosswalk awareness tests +โ +โโโ ๐ Docs +โ โโโ README.md # Main project doc +โ โโโ INSTALLATION.md # Install guide +โ โโโ CONTRIBUTING.md # Contribution guide +โ โโโ FAQ.md # Frequently Asked Questions +โ โโโ CHANGELOG.md # Changelog +โ โโโ SECURITY.md # Security policy +โ โโโ PROJECT_STRUCTURE.md # This file +โ +โโโ ๐ณ Docker +โ โโโ Dockerfile # Docker image +โ โโโ docker-compose.yml # Docker Compose config +โ โโโ .dockerignore # Docker ignore list +โ +โโโ โ๏ธ Config +โ โโโ .env.example # Environment variable template +โ โโโ .gitignore # Git ignore list +โ โโโ requirements.txt # Python deps +โ โโโ setup.sh # Linux/macOS setup script +โ โโโ setup.bat # Windows setup script +โ +โโโ ๐ License +โ โโโ LICENSE # MIT License +โ +โโโ ๐ง GitHub + โโโ .github/ + โโโ ISSUE_TEMPLATE/ + โ โโโ bug_report.md + โ โโโ feature_request.md + โโโ pull_request_template.md +``` + +## ๐ Key Files Overview + +### Main Application Layer + +#### `app_main.py` +- **Purpose:** FastAPI main service handling all WebSocket connections +- **Key Features:** + - WebSocket routing (`/ws/camera`, `/ws_audio`, `/ws/viewer`, etc.) + - Model loading & initialization + - State coordination & management + - Audio/video stream distribution +- **Depends on:** All other modules +- **Entry point:** `python app_main.py` + +#### `navigation_master.py` +- **Purpose:** Central navigation controller; manages the system state machine +- **Primary States:** + - IDLE โ idle + - CHAT โ dialogue mode + - BLINDPATH_NAV โ tactile path navigation + - CROSSING โ crosswalk + - TRAFFIC_LIGHT_DETECTION โ traffic-light detection + - ITEM_SEARCH โ item search +- **Core Methods:** + - `process_frame()` โ per-frame processing + - `start_blind_path_navigation()` โ start tactile path navigation + - `start_crossing()` โ start crosswalk mode + - `on_voice_command()` โ handle voice commands + +### Workflow Modules + +#### `workflow_blindpath.py` +- **Purpose:** Core logic for tactile path navigation +- **Features:** + - Path segmentation & detection + - Obstacle detection + - Turn detection + - Optical-flow stabilization + - Directional guidance generation +- **State Machine:** + - ONBOARDING โ getting onto the path + - NAVIGATING โ navigating along the path + - MANEUVERING_TURN โ handling turns + - AVOIDING_OBSTACLE โ obstacle avoidance + +#### `workflow_crossstreet.py` +- **Purpose:** Crosswalk navigation logic +- **Features:** + - Crosswalk detection + - Directional alignment + - Guidance generation +- **Core Methods:** + - `_is_crosswalk_near()` โ determine crosswalk proximity + - `_compute_angle_and_offset()` โ compute angle and lateral offset + +#### `yolomedia.py` +- **Purpose:** Item search workflow +- **Features:** + - YOLO-E prompt-based detection + - MediaPipe hand tracking + - Optical-flow target tracking + - Hand guidance (direction prompts) + - Grasp-action detection +- **Modes:** + - `SEGMENT`, `FLASH`, `CENTER_GUIDE`, `TRACK` + +### Speech / Voice Modules + +#### `asr_core.py` +- **Purpose:** AliCloud Paraformer ASR (real-time speech recognition) +- **Features:** + - Real-time transcription + - VAD (Voice Activity Detection) + - Result callbacks +- **Key Class:** `ASRCallback` + +#### `omni_client.py` +- **Purpose:** Qwen-Omni-Turbo multimodal dialogue client +- **Features:** + - Streaming dialogue generation + - Image + text inputs + - Speech output +- **Core Function:** `stream_chat()` + +#### `audio_player.py` +- **Purpose:** Unified audio playback manager +- **Features:** + - TTS playback + - Multi-channel audio mixing + - Volume control + - Thread-safe playback +- **Core Functions:** `play_voice_text()`, `play_audio_threadsafe()` + +### Model Backends + +#### `yoloe_backend.py` +- **Purpose:** YOLO-E open-vocabulary backend +- **Features:** + - Prompt setup + - Real-time segmentation + - Target tracking +- **Key Class:** `YoloEBackend` + +#### `trafficlight_detection.py` +- **Purpose:** Traffic-light detection module +- **Detection Methods:** + 1. YOLO model detection + 2. HSV color classification (fallback) +- **Output:** Red / Green / Yellow / Unknown + +#### `obstacle_detector_client.py` +- **Purpose:** Obstacle detection client +- **Features:** + - Whitelist category filtering + - In-mask (path) checks + - Object attributes (area, position, risk) + +### Video Processing + +#### `bridge_io.py` +- **Purpose:** Thread-safe frame buffering & distribution +- **Features:** + - Producerโconsumer pattern + - Raw frame buffer + - Processed frame fan-out +- **Core Functions:** + - `push_raw_jpeg()` โ receive ESP32 frames + - `wait_raw_bgr()` โ get raw frame + - `send_vis_bgr()` โ send processed frame + +#### `sync_recorder.py` +- **Purpose:** Synchronized audio/video recording +- **Features:** + - Sync record video & audio + - Auto timestamped filenames + - Thread safety +- **Outputs:** `recordings/video_*.avi`, `audio_*.wav` + +### Frontend + +#### `templates/index.html` +- **Purpose:** Web monitoring interface +- **Main Areas:** + - Video stream display + - Status panel + - IMU 3D visualization + - Speech recognition results + +#### `static/main.js` +- **Purpose:** Main JavaScript logic +- **Features:** + - WebSocket connection management + - UI updates + - Event handling + +#### `static/vision.js` +- **Purpose:** Vision stream handling +- **Features:** + - Receive video frames via WebSocket + - Canvas rendering + - FPS calculation + +#### `static/visualizer.js` +- **Purpose:** IMU 3D visualization (Three.js) +- **Features:** + - Receive IMU data + - Real-time pose rendering + - Dynamic lighting effects + +## ๐ Data Flow + +### Video Stream +``` +ESP32-CAM +โ [JPEG] WebSocket /ws/camera +โ bridge_io.push_raw_jpeg() +โ yolomedia / navigation_master +โ bridge_io.send_vis_bgr() +โ [JPEG] WebSocket /ws/viewer +โ Browser Canvas +``` +### Audio Stream (Upstream) +``` +ESP32-MIC +โ [PCM16] WebSocket /ws_audio +โ asr_core +โ DashScope ASR +โ Recognition Result +โ start_ai_with_text_custom() +``` + +### Audio Stream (Downstream) + +``` +Qwen-Omni / TTS +โ audio_player +โ [PCM16] audio_stream +โ [WAV] HTTP /stream.wav +โ ESP32 Speaker +``` + + +### IMU Data Stream +``` +ESP32-IMU +โ [JSON] UDP 12345 +โ process_imu_and_maybe_store() +โ [JSON] WebSocket /ws +โ visualizer.js (Three.js) +``` + +## ๐ฏ Key Design Patterns + +### 1. State Machine Pattern +- **Location:** `navigation_master.py` +- **Purpose:** Manage system state transitions +- **States:** IDLE โ CHAT / BLINDPATH_NAV / CROSSING / ... + +### 2. ProducerโConsumer Pattern +- **Location:** `bridge_io.py` +- **Purpose:** Decouple video reception and processing +- **Implementation:** Threads + Queues + +### 3. Strategy Pattern +- **Location:** Each `workflow_*.py` +- **Purpose:** Implement different navigation strategies +- **Implementation:** Unified `process_frame()` interface + +### 4. Singleton Pattern +- **Location:** Model loading +- **Purpose:** Share model instances globally +- **Implementation:** Global variables + initialization checks + +### 5. Observer Pattern +- **Location:** WebSocket communication +- **Purpose:** Allow multiple clients to subscribe to video streams +- **Implementation:** `camera_viewers: Set[WebSocket]` + +## ๐ฆ Dependencies +``` +app_main.py +โโโ navigation_master.py +โ โโโ workflow_blindpath.py +โ โ โโโ yoloe_backend.py +โ โ โโโ obstacle_detector_client.py +โ โโโ workflow_crossstreet.py +โ โโโ trafficlight_detection.py +โโโ yolomedia.py +โ โโโ yoloe_backend.py +โโโ asr_core.py +โโโ omni_client.py +โโโ audio_player.py +โโโ audio_stream.py +โโโ bridge_io.py +โโโ sync_recorder.py +``` + +## ๐ Startup Process + +1. **Initialization Phase** (`app_main.py`) + - Load environment variables + - Load navigation models (YOLO, MediaPipe) + - Initialize the audio system + - Start the recording system + - Preload the traffic light detection model + +2. **Service Launch** (FastAPI) + - Register WebSocket routes + - Mount static files + - Start UDP listener (for IMU data) + - Start HTTP service (port 8081) + +3. **Runtime Phase** + - Wait for ESP32 connection + - Receive video/audio/IMU data + - Process user voice commands + - Push real-time processing results + +4. **Shutdown Phase** + - Stop recording (save files) + - Close all WebSocket connections + - Release model resources + - Clean up temporary files + +--- + +**Note:** For detailed implementation of each module, please refer to the corresponding source file comments and docstrings. diff --git a/README.md b/README.md index bb9ddd9..d200ea1 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ # AI ๆบ่ฝ็ฒไบบ็ผ้็ณป็ป ๐ค๐ +[๐ฌ๐ง English Version here](README_en.md)
+
+
+
+
+
+## ๐ Table of Contents
+
+- [Features](#-features)
+- [System Requirements](#-system-requirements)
+- [Quick Start](#-quick-start)
+- [System Architecture](#-system-architecture)
+- [Usage Guide](#-user-guide)
+- [Configuration](#-configuration-guide)
+- [Developer Documentation](#-developer-documentation)
+- [FAQ](#-faq)
+- [Contribution Guidelines](#contribution-guidelines)
+- [License](#-license)
+- [Acknowledgments](#acknowledgments)
+
+---
+
+## โจ Features
+
+### ๐ถ Tactile Path Navigation
+- **Real-time path detection** โ detects tactile paving paths with YOLO segmentation.
+- **Smart voice guidance** โ provides precise directional prompts (turn left, right, straight, etc.).
+- **Obstacle detection and avoidance** โ identifies obstacles ahead and plans avoidance routes.
+- **Turn detection** โ detects sharp turns and gives early voice warnings.
+- **Optical flow stabilization** โ uses LucasโKanade optical flow to stabilize mask tracking.
+
+### ๐ฆ Crosswalk Assistance
+- **Crosswalk detection** โ detects zebra crossings in real time.
+- **Traffic light detection** โ identifies light states via color and shape.
+- **Alignment guidance** โ helps align to the crosswalk center.
+- **Safety prompts** โ announces when the light is green to cross safely.
+
+### ๐ Object Recognition & Search
+- **Voice-based search** โ e.g. โfind the Red Bull for me.โ
+- **Real-time tracking** โ YOLO-E open vocabulary detection + ByteTrack tracking.
+- **Hand guidance** โ uses MediaPipe hand detection to guide hand position.
+- **Grasp detection** โ detects grasp motion confirming the object is picked up.
+- **Multi-modal feedback** โ visual overlay + audio + centering cue.
+
+### ๐๏ธ Real-Time Voice Interaction
+- **Speech Recognition (ASR)** โ powered by Alibaba DashScope Paraformer.
+- **Multimodal conversation** โ Qwen-Omni-Turbo supports image + text input with speech output.
+- **Smart command parsing** โ distinguishes navigation, search, and chat.
+- **Context awareness** โ ignores irrelevant commands in the wrong mode.
+
+### ๐น Video & Audio Processing
+- **Real-time video streaming** โ via WebSocket, supports multiple viewers.
+- **Audio/video recording** โ saves synchronized recordings with timestamps.
+- **IMU fusion** โ supports ESP32 IMU data for pose estimation.
+- **Multi-channel audio mixing** โ system voice, AI responses, ambient audio together.
+
+### ๐จ Visualization & Interaction
+- **Web dashboard** โ view processed video in real time.
+- **3D IMU visualization** โ real-time rendering via Three.js.
+- **Status panel** โ shows current navigation state, detections, FPS, etc.
+- **Chinese UI (customizable)** โ all UI and speech currently in Chinese, can be localized.
+
+---
+
+## ๐ป System Requirements
+
+### Hardware
+**Server / Development**
+- CPU: Intel i5 or above (i7/i9 recommended)
+- GPU: NVIDIA GPU with CUDA 11.8+ (RTX 3060 or higher recommended)
+- RAM: 8 GB (min) / 16 GB (recommended)
+- Storage: 10 GB free
+
+**Client (optional)**
+- ESP32-CAM or WebSocket camera
+- Microphone
+- Speaker/headphones
+
+### Software
+- OS: Windows 10/11, Ubuntu 20.04+, macOS 10.15+
+- Python 3.9 โ 3.11
+- CUDA 11.8 or higher (for GPU)
+- Browser: Chrome 90+, Firefox 88+, Edge 90+
+
+### API Keys
+- **Alibaba DashScope API Key** (required)
+ For speech recognition and Qwen-Omni dialogue.
+ โ