Scuba diving gestures recognition using Mediapipe, cv2 and PyTorch
I was always been a huge fan of Minority Report and awaited the day when we could use gestures for our day to day use. Then came Pranav Mistry with his Sixth Sense technology which blew my mind. However it was too hardware focussed.
Google came out with Mediapipe in 2019. I had just completed my Scuba certifications in Open and Advanced Open Water scuba diving when I came across some cool animations on Facebook using Mediapipe. A quick search led me to Nicholas Renotte's famous Sign Language video. I was super impressed by the process of webcam images using cv2 and implemented the approach for scuba diving signals (I had just completed by Open Water and Advanced Open water courses then) using Pytorch.
Repo - Github repo
Documentation - https://google.github.io/mediapipe/
To train a model to capture simple video feed from the webcam and categorize the gestures shown by the user into one of five actions:
- Ok
- Stop
- Descend
- Not Ok
- Ascend
- Testing the camera, mediapipe library (To ensure adequacy of lighting / setup, get fps of the camera to calculate sequence length
- Capture data from the webcam for various actions i.e. for 5 actions, gather 20 samples, each of which is a 1 second video (30 frames)
- Convert 1500 numpy files of gestures,each of 63 points, to a 150X30X63 tensor matrix
- One-hot code labels and save both files
- Import model architecture and train the model on a single batch of 142 samples (after train and test split)
- Train on test clips
- Testing on live data feed
- Processing and storing the renders
- Mediapipe landmarks can get significantly impacted by the lighting - both at the time of data collection and at the time of inference.
- If you're not careful while consolidating the various frames for your input dataset, the order of labels can get scattered. After completing your one-hot encoding, run a sample check across your classes to determine its index in the encoding
- More samples ! I could record only 150 samples across 5 different action classes
- Stability over precision. Video processing has this annoying property of rapidly changing classes as frames change. So to avoid this, it takes this model about 3-4 frames before it stabilizes its prediction class. Hence, you may see a bit of jitter in the displayed result but it will immediately stabilize. I had this issue in my [object classification project](https://github.com/SwamiKannan/Formula1-car-detection)
I. Data capture
II. Data Processing
III. Model training
Gestures will take a second to align to the correct label
Image credit for cover image: Rooster Teeth