-
- OpenCV
- Yolov8Pose
- My own dataset
- MLP
- Cellphone
I had this idea after watching a video (couldn't find the video to link here) in which a prison guard was passing between cells and was attacked by an inmate. She was saved by the other inmates who ran to her aid. In the video, I noticed that there was a considerable amount of time between the start of the attack and the guards arriving at the scene, with the attack already prevented and the aggressive inmate already restrained by the other inmates. So I thought, if there was a system that could read the security camera feed, detect the type of behavior, and if it was abnormal behavior, for example, the inmate's attack, the system would emit an alert to the guards, this type of situation could be avoided. But obtaining a dataset from prison security cameras would be a complicated and time-consuming bureaucracy, so I decided to direct the project to residential security systems. And even directing the project to residential security, I did not find a dataset on behavior, so I had to create my own which can be found on my Kaggle profile at this link.
-
I started the project by creating the training and testing dataset. Since I didn't have the necessary funds to acquire some security cameras, I positioned my cellphone on the outer wall of my house.
I recorded 152 videos of normal behaviors and 152 of abnormal behaviors. I considered the following behaviors abnormal:
- Turn towards the residence and stare at it.
- Grab the gate.
- Try to climb the gate.
- Garage gate:
- Mess with the lock.
- Try to lift it.
- Stand in front of it.
I separated the raw recording into videos of around 10 seconds each, hence 152. For normal behaviors, I simply walked from one point to another on the street, entering and leaving the frame. For abnormal behaviors, I performed them timing each for 10 seconds.
- After creating the dataset of videos separated into Normal and Abnormal folders, I decided to create a numerical dataset, which is the job of the make_dataset.py code. This code is responsible for reading a video frame by frame, passing each frame through the YOLOv8-Pose AI which identifies the keypoints of the people in the frame, saving the result of each frame to a json file inside a folder. For example, reading the abnormal_1 video will create the directory ./Abnormal/abnormal_1/frame_x.json where x is the frame number.
-
The machine learning model is a MLP with 3 hidden layers and 1 output layer, its architecture is tapered, the first layer has 512 neurons, the second has 256, and the third has 128. The output layer has 1 binary result neuron, where 1 indicates Abnormal and 0 indicates normal. There are dropout and batch normalization between the hidden layers, the training code is ml_training.py.
The input takes a vector of 340 positions. You might be wondering, but if YOLOv8-Pose returns 17 key points for each person, how is the input 340?
Well, the input is 340 because each key point is an XY coordinate. But the total count still doesn't add up, right? The total input being 340 was a choice so that the MLP could understand a movement, with all its fluidity. If I passed only one frame at a time, it would learn as if it were seeing photos. To correct this, the code takes the first 10 frames from the moment the person enters the video, these 10 frames * 17XY create a 340 input. After getting the first 10 frames, the sliding window moves one frame forward, meaning if it took from 0 to 9 on the first iteration, on the second it will take from 1 to 10. I chose the sliding window of size 10 because 10 frames are a few seconds of video, which allows me to get a good reading of the video without a noticeable delay.