This repository contains some preliminary experiments focused on face analysis from video, covering the following key objectives:
-
Person tracking & identification: Track a person in a video and potentially identify them. This codebase brings us close to achieving that goal, which I know would make Yibei very happy. One suggested improvement (I didn't have the time to implement it, sorry): currently, we cluster faces based solely on their embeddings, without considering their location within the frame. To enhance accuracy, we could incorporate the face’s position in the frame. A proposed approach is to use a sliding window over time combined with a majority voting algorithm to assign labels.
-
Active speaker detection: Once faces are tracked in the video, you can apply speaker diarization (commonly referred to as active speaker detection in the video space). The model available here performs quite well for this task and combines audio+video.
-
Facial descriptor extraction: Extract a vector of descriptors from each face, including sex, age, emotion, speaking flag (yes/no), and Facial Action Coding System (FACS) units. When combined with the speech vector that we have already discussed within senselab, this may enable some cool multimodal analyses.
Once we will have a robuster expertise with these models, we should move our functions and pipelines to senselab (@jordan, feel free to start).
-
Install the necessary packages:
pip install -r requirements.txt
-
Edit
video_path
pointing to your video file to be analyzed. I would suggest starting with a very short video that includes some edge cases (e.g., a zoom call with someone sharing their screen at some point). -
To run the experiments:
cd src python main.py