This project implements a real-time hand digit recognition system using the Vision Transformer (ViT) model and OpenCV for webcam input. The model is trained to recognize hand gestures representing digits (0-9) from the Chinese Hand Gesture Number Recognition Dataset.
- Real-time hand gesture recognition using a webcam
- Digit/Number recognition using a Vision Transformer (ViT) model
- Confidence score for each prediction
- Displays predicted digit and confidence on the video feed
- Hand detection ensures prediction is only made when a hand is detected
This project demonstrates the application of a Vision Transformer (ViT) model for real-time hand digit recognition. The model was trained on a custom dataset of approximately 5000 containing images of hand gestures representing digits (1-10) and can classify them based on webcam input. The system detects 87% accurately in this tiny dataset.
The goal is to develop an interactive, real-time system that can recognize hand signs and provide accurate predictions, making it applicable for applications such as virtual sign language recognition, gesture-based control systems, accessible people use-case and educational tools.
Before running the project, ensure you have the following dependencies installed:
- Python 3.7+
- PyTorch 1.9+
- OpenCV 4.5+
- Pillow
- transformers
- numpy
- matplotlib
You can install the necessary libraries using pip:
pip install torch torchvision opencv-python pillow transformers numpy matplotlib
Clone the project repository to your local machine:
git clone https://github.com/yourusername/hand-gesture-recognition.git
cd hand-gesture-recognition
Option 1: Use the pre-trained model from the checkpoint directory (you should already have the model trained). Place your trained model in the directory ./vit_results/checkpoint-1582/
.
Option 2: Train the model on the dataset and save the checkpoint. The model should be a Vision Transformer (ViT) model fine-tuned for your specific hand gesture dataset.
Ensure your dataset is organized as follows:
/Datasets
/Chinese Hand Gestures Number Recognition Dataset
/aug_imgs_split
/train
/01
/02
/03
...
/val
/01
/02
/03
...
/test
/01
/02
/03
...
The dataset should be divided into training, validation, and testing folders, each containing subfolders for each digit class (e.g., 01, 02, ..., 10).
To run the real-time digit recognition system using your webcam after training with your dataset, execute the following command:
python vit_real_time_recognition.py
Once you run the script, a window will open displaying the webcam feed. If a hand is detected in the frame, the model will predict the digit and display it along with the confidence score. If no hand is detected, it will display "No Hand Detected."
The predicted digit will be shown on the video feed with the format:
Predicted Digit: 01
Confidence: ~95.12%
-
Predicted Digit: XX shows the predicted digit, formatted with leading zeros.
-
Confidence: XX% shows the confidence level for the prediction.
-
Validation Accuracy: 87%
-
Test Accuracy: 86%
This project is licensed under the MIT License - see the LICENSE file for details.
- Vision Transformer (ViT): For digit recognition.
- OpenCV: For real-time image processing and webcam input.
- Chinese Hand Gesture Number Recognition Dataset: For training the model.