HandTalk presents a real-time, end-to-end sign language recognition system that leverages spatiotemporal feature engineering and bi-directional recurrent architectures to achieve low-latency gesture-to-text translation. By implementing a distributed inference pipeline with full-duplex WebSocket communication, the system maintains sub-200ms inference latency while processing 30 FPS video streams. This work addresses critical accessibility barriers for speech-impaired individuals through a scalable, production-ready architecture that balances computational efficiency with recognition accuracy across diverse signing contexts.
HandTalk implements a Sequence-to-Label classification paradigm optimized for temporal dynamics inherent in sign language communication. The architecture decouples data acquisition, feature preprocessing, and model inference across three computational layers:
MediaPipe Landmark Detection
The frontend leverages Google's MediaPipe framework to extract high-fidelity skeletal representations from RGB video streams:
- Hand Landmarks: 21 keypoints per hand (42 total) capturing joint positions in 3D space (x, y, z)
- Pose Landmarks: 33 keypoints representing upper body skeletal structure
- Sampling Rate: 30 FPS (33.3ms inter-frame interval)
- Feature Dimensionality: 63-99 dimensional feature vector per frame depending on pose inclusion
Rationale: Landmark-based representation reduces computational overhead compared to raw pixel processing while preserving essential geometric relationships critical for gesture semantics. The coordinate normalization ensures the system achieves translation, scale, and rotation invariance.
To ensure robust performance across varying camera positions, user distances, and hand sizes, the system implements multi-stage normalization:
Stage 1: Translation Invariance
centroid = mean(landmarks[:, 0:2]) # Compute 2D centroid
normalized_coords = landmarks - centroid # Zero-center coordinates
Stage 2: Scale Invariance
max_distance = max(||normalized_coords_i - centroid||) # L2 norm
scale_factor = 1.0 / max_distance
scaled_coords = normalized_coords * scale_factor
Stage 3: Temporal Buffering
Frames are aggregated into fixed-length sequences of 30 frames, forming a 3D tensor of shape [Batch=1, Time_Steps=30, Features=63]. This temporal window captures the complete motion trajectory of sign gestures, which typically span 0.8-1.2 seconds.
Traditional HTTP-based architectures introduce unacceptable latency (>500ms) for real-time applications. HandTalk employs Socket.io for bidirectional communication:
- Client → Server: Landmark vectors transmitted as JSON payloads at 30 Hz
- Server → Client: Class predictions with confidence scores returned asynchronously
- Protocol Overhead: ~8-12ms per round-trip (excluding inference time)
Architectural Advantage: Persistent connections eliminate TCP handshake overhead, enabling the system to maintain continuous prediction streams without connection re-establishment penalties.
Sign language gestures are inherently non-causal — the semantic meaning often depends on the complete motion trajectory rather than sequential frame-by-frame analysis. A unidirectional LSTM processes sequences left-to-right, limiting its ability to leverage future context during early-frame classification.
Bi-Directional LSTM Solution:
- Forward Pass: Captures temporal dependencies from gesture onset to completion
- Backward Pass: Propagates information from gesture endpoint back to initial frames
- Concatenated Hidden States: The model fuses both temporal directions, enabling holistic trajectory understanding
Input Shape: (Batch, 30, 63) # 30 frames × 63 features
Layer 1: Bi-LSTM(128 units, return_sequences=True)
→ Forward Hidden States: (Batch, 30, 128)
→ Backward Hidden States: (Batch, 30, 128)
→ Concatenated Output: (Batch, 30, 256)
Layer 2: Dropout(0.3) # Regularization
Layer 3: Bi-LSTM(64 units, return_sequences=False)
→ Output: (Batch, 128) # Aggregated temporal representation
Layer 4: Dense(128, activation='relu')
Layer 5: Dropout(0.4)
Layer 6: Dense(num_classes, activation='softmax')
→ Output: (Batch, num_classes) # Probability distributionKey Design Decisions:
- Return Sequences: Enabled in first layer to preserve temporal structure for second layer
- Gradient Clipping: Implemented to prevent exploding gradients during backpropagation through time
- Optimizer: Adam with learning rate decay (initial: 0.001, decay: 0.95 per epoch)
- Framework: Flutter 3.x (Dart)
- Computer Vision: MediaPipe Hands & Pose (Google ML Kit)
- State Management: Provider pattern for reactive UI updates
- Networking: socket_io_client for WebSocket integration
- Web Framework: Flask 2.x with Flask-SocketIO
- Deep Learning: TensorFlow/Keras 2.x
- Preprocessing: NumPy, OpenCV
- Deployment: Docker-ready containerization (future: Kubernetes orchestration)
- Protocol: Socket.io v4.x (WebSocket transport with HTTP long-polling fallback)
- Serialization: JSON for landmark data, MessagePack for optimized binary transmission (future)
┌─────────────┐ WebSocket ┌──────────────┐
│ Flutter │ ←─────────────────→ │ Flask API │
│ Client │ (Socket.io) │ Gateway │
└─────────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ Bi-LSTM │
│ Inference │
│ Engine │
└──────────────┘
| Metric | Value | Notes |
|---|---|---|
| Inference Latency | 45-80ms | Measured on Intel i7 CPU |
| End-to-End Latency | 150-200ms | Including network + preprocessing |
| Throughput | 30 predictions/sec | Matches camera frame rate |
| Model Accuracy | 92-95% | On validation set (lab conditions) |
| Model Parameters | ~850K | Compact for edge deployment |
HandTalk is organized into two primary sub-systems. For detailed technical documentation on each, please refer to their respective directories:
- Backend Inference Engine: Details on the Bi-LSTM model, preprocessing pipeline, and WebSocket handlers.
- Frontend Mobile Client: Details on the Flutter architecture, MediaPipe integration, and UI/UX design.
Current models exhibit signer-specific overfitting. Future work includes:
- Domain Adaptation: Implement adversarial training to learn signer-invariant features
- Few-Shot Learning: Meta-learning approaches (MAML, Prototypical Networks) for rapid adaptation to new signers with minimal examples
Landmark sequences form a skeletal graph where joints are nodes and bones are edges. ST-GCN architectures naturally model:
- Spatial Dependencies: Relationships between joints (e.g., finger synchronization)
- Temporal Dependencies: Motion patterns across frames
Expected Improvement: 5-8% accuracy gain over LSTM-based approaches, particularly for complex multi-hand gestures.
Integrate Temporal Attention layers to identify critical frames within gesture sequences, enabling:
- Model Explainability: Visualization of which frames contribute most to classification
- Computational Efficiency: Dynamic frame selection to skip redundant frames
Extend from isolated gesture classification to continuous stream processing:
- Segmentation Networks: Detect gesture boundaries in unsegmented video
- CTC Loss Function: Connectionist Temporal Classification for sequence alignment
Incorporate facial expressions and body pose for holistic sign language understanding:
- Late Fusion: Combine hand, face, and pose predictions at decision level
- Early Fusion: Concatenate multi-modal features before LSTM processing
- Python 3.8+ with virtualenv
- Flutter SDK 3.0+
- CUDA-enabled GPU (optional, for training)
cd backend
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
python run.py # Server runs on http://localhost:5000cd personnemuette # Flutter project directory
flutter pub get
flutter run # Launches on connected device/emulator# Build backend container
docker build -t handtalk-backend ./backend
docker run -p 5000:5000 handtalk-backend
# Frontend requires native compilation
cd personnemuette && flutter build apk # Android
cd personnemuette && flutter build ios # iOS (requires macOS)This project builds upon foundational research in sign language recognition:
- MediaPipe Framework: Lugaresi et al., "MediaPipe: A Framework for Building Perception Pipelines," arXiv:1906.08172
- Bi-LSTM Architectures: Schuster & Paliwal, "Bidirectional Recurrent Neural Networks," IEEE Trans. Signal Processing, 1997
- ST-GCN Baseline: Yan et al., "Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition," AAAI 2018
- Sign Language Datasets: MS-ASL (Joze et al., 2018), WLASL (Li et al., 2020)
We welcome contributions from the research community! Areas of interest:
- Data Augmentation: Geometric transformations, temporal jittering
- Model Optimization: Quantization for mobile deployment (TFLite, ONNX)
- Multilingual Support: Extending beyond ASL to international sign languages
Please submit pull requests with clear documentation and unit tests.
This project is licensed under the MIT License - see LICENSE file for details.
- MediaPipe Team at Google Research for landmark detection frameworks
- TensorFlow Team for robust deep learning infrastructure
- Deaf Community for insights into sign language linguistics and accessibility needs
Developed as a research prototype for advanced human-computer interaction and accessibility.