You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have given much thought to implementing this feature, and here are my ideas and thoughts.
Considering that we already have pose data (facial landmarks extracted using mediapipe), we can use the "rule-based" method or the DNN model method.
Rule-based method:
A rule-based method is when you define certain rules and thresholds for each decision. For example, if the distance between the landmark of the upper eyelid and the lower eyelid is less than 0.1, the left eye is closed.
The advantage of this method is that you need a tiny amount of data. However, it lacks robustness; what if the head is tilted upward and the distance between the eyelids is smaller than the threshold? Deep learning model:
A deep learning model that was trained on many samples would be able to do the task, it will be able to generalize better and thus perform well on new data.
However, training a DNN model requires a lot of labeled data, for example, samples of non-blinking and blinking sequences, which we do not have. What we do have is a large dataset of unlabeled faces and a very small dataset of labeled actions.
So what I am proposing is to use unsupervised pretraining (on the large unlabeled data) followed by supervised fine-tuning (on the smaller labeled data).
The proposed method is a masked autoencoder, as described here.
The masked autoencoder will take a pose sequence as input with the dimensions of
Nx64 framesx478 landmarksx2 coordinates.
The first step is to embed the sequences, which will be done using a temporal dilated CNN (see paper).
Next, each embedded landmark will have a class encoding (like positional encoding),
Then, a portion of the landmarks will be omitted (masked).
Next, the remaining embedded landmarks will be fed into a transformer encoder-decoder model, which will eventually reconstruct the original pose sequence.
Finally, after the model is trained, we will collect labeled datasets from the crowd and use it to fine-tune the pre trained encoder on a classification task.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey everyone,
I have given much thought to implementing this feature, and here are my ideas and thoughts.
Considering that we already have pose data (facial landmarks extracted using mediapipe), we can use the "rule-based" method or the DNN model method.
Rule-based method:
A rule-based method is when you define certain rules and thresholds for each decision. For example, if the distance between the landmark of the upper eyelid and the lower eyelid is less than 0.1, the left eye is closed.
The advantage of this method is that you need a tiny amount of data. However, it lacks robustness; what if the head is tilted upward and the distance between the eyelids is smaller than the threshold?
Deep learning model:
A deep learning model that was trained on many samples would be able to do the task, it will be able to generalize better and thus perform well on new data.
However, training a DNN model requires a lot of labeled data, for example, samples of non-blinking and blinking sequences, which we do not have. What we do have is a large dataset of unlabeled faces and a very small dataset of labeled actions.
So what I am proposing is to use unsupervised pretraining (on the large unlabeled data) followed by supervised fine-tuning (on the smaller labeled data).
The proposed method is a masked autoencoder, as described here.
Nx64 framesx478 landmarksx2 coordinates.
Beta Was this translation helpful? Give feedback.
All reactions