This project is based on the detection of suspicious behavior of a person analyzing the facial aspects of the respective person. The project primarily focuses the implementation to be done as a lab monitoring system.
The overview of the system is shown below.
As observed in the above image, a computer installed with an application that runs in the background is provided to the user (student) and a web application is accessed by the officials. The application consists of real-time detection and algorithms that detect the suspicious behavior of the user and provide instant warnings to the officials. The detection algorithms focus on the certain important aspects of the face as well as some external aspects to make the detection possible. The detection details and warnings are sent from the computer to the remote server from which the details are updated to a database and also rendered to a web application. The warnings are notified to the officials by means of this web application.
Note that there can be unexpected delays and changes to this plan.
The detection of the face is an essential feature for this monitoring system. Though this detection could serve as an attendance recording procedure at the initial stage of the monitoring process, the availability of the student through out the monitoring session is highly important. Also, the monitoring system ensures that the student solely does his work. Here is when the detection of multiple faces comes to play.
The frame of webcam input video is taken and processed for detection of faces in the frame using the MediaPipe Face Detection
model. The face detection feature is done in two ways. One way is be the detection of no-faces and then the other way is the detection of multi-faces in input video frame. The model upon fed with an input frame results in a binary outcome , that is a at least a face is detected in the frame or not. If detected , then the system checks whether a single face or multi faces are detected. As per the needs of the system, multi faces being detected and no face being detected , both are considered as suspicious detections to be notified to the officials.
- HAAR cascade classifier
HCC
- Multi-task Cascaded Convolutional Networks
MTCNN
Detection Strategy | Limitations | Positives | Improvements |
---|---|---|---|
HAAR Classifier |
- Detection of Non-Faces as Faces at some instances - No detection of faces when the lighting is less |
- Simple & lightweight | - Asynchronous Programming - Multi-Threading |
Multi-Task CNN |
- Inability to limit the distance of detection | - Asynchronous Programming - Multi-Threading |
|
MediaPipe Face Detection Model |
- Lightweight Object Detection - Effective GPU utilization - Quality Prediction - Allows Estimation Face Rotation (roll angle) |
This system calculates a good estimation of the angle of orientation of the face of a student. The system utilizes 3D Coordinate Geometry
to achieve this estimation.
A 3D Coordinate plane is created using the 3D coordinates of the facial landmarks obtained using the MediaPipe Pose Estimation
. The nose point landmarkN
, left L
and right R
ear landmarks are extracted as 3D points
(x,y,z) coordinate tuple
from the results obtained using the model for a frame input from the webcam.
- Get Nose point coordinates (N)
- Get Left and Right Ear point coordinates (L & R)
- Get 3D line vector (LR)
- Get 3D point (M)
- Get 3D line vector (NM)
- Get 3D Plane perpendicular to Camera (P)
- Find angle between line NM & plane P
A Speaking Detection Model pre-trained using HOG
based dlib
face detector is utilized to predict whether a lip movement is observed or not.
- Collect a sequence of 25 frames
- For each video frame in this sequence :
- Detect the face in the frame using a face detector (MediaPipe Face Mesh Model ).
- From the landmark predictor, fetch the points that mark the inner edges of the top and bottom lip.
- Calculate the average pixel separation between each part pairs and store this distance value into the lip separation sequence.
- Once all 25 frames are processed this way, perform min-max scaling over the 25-length sequence.
- Feed this normalized lip separation sequence into the RNN.
- The RNN generates a 2-element tuple (speech, silence) that represents the likelihood that the speaker was speaking or silent during the 25 video frames before it.
- Repeat the process for the next 25-frame window of the input video