Human action recognition (HAR) is an important task in the computer vision domain, especially in many situations such as video surveillance, video content analysis, video security control. However, it is a challenging task due to background clutter, lighting and the fact that human actions are usually variant over time, from different viewpoints, and occluded by other objects in environment In this Captone Project, we focuse on studying action recognition with multi-modality. We examine and evaluate 3 different approaches on the two main datasets.
- Python 3.6 or higher
- PyTorch and torchvision.
- OpenCV with GPU support
- timm==0.4.8/0.4.12
- TensorboardX
-
We use 2 datasets HMDB51 and UCF101 with extracted first 10 classes in each dataset. The full dataset and splits can be download from:
-
For Temporal Segments Network and Motion-Augmented RGB Stream, download dataset here and annotation file.
Three model are implemented and evaluated on above datasets, which are Temporal Segment Network (TSN), Motion-Augmented RGB Stream (MARS), Video Masked AutoEncoders (VideoMAE). Detail implementation, instruction for training and our evaluation result are included in each corresponding folder.