Action recognition task involves the identification of different actions from videos where the action may or may not be performed throughout the entire duration of the video. Classifying actions (jumping, picking, playing etc) in still images has been solved with a high accuracy using Deep Convolutional Networks. However, for action recognition in videos, there exists only limited work with poor accuracy. Action recognition from videos requires capturing context from the whole video rather than just capturing information from each frame as compared to an image.
We aim to extend the existing model of Deep Convolutional Neural Network used for classifying action in still images. Relevant work already exists for the same which uses stacked frames for input to CNN, but the results are poor. So, to improve upon this architecture, we implement a new architecture for action recognition in videos which is based on the human visual system. This architecture is based on the human visual cortex system which uses two-streams to detect and recognise motion:
- Ventral Stream: used for performing object recognition
- Dorsal Stream: used for recognising motion The author’s architecture involves two separate recognition streams - spatial and temporal. Spatial Stream performs action recognition in still frames which extracted from the video. Temporal Stream recognises action from motion using dense optical flow.
We can decompose a video into two components i.e. spatial and temporal. Each of the two streams are implemented using deep CNN:
- Spatial Stream ConvNet: Spatial Stream carries information about the objects and scene in the video-in RGB format. This network operates on individual frames extracted from the video at some sampling rate.
- Temporal Stream ConvNet: Temporal Stream carries information about the movement of the of the object(or the camera) in the scene (refer Fig. 2) i.e. the direction of the object. The input fed to this model, is formed by stacking multiple optical flow displacement images between consecutive video frames.
We are using the famous dataset called UCF101 for our project. UCF101 is a dataset of human actions.We extracted 1. UCF101 have 101 classes which are divided into five types: - Human-Object Interaction - Body-Motion Only - Human-Human Interaction - Playing Musical Instruments - Sports The dataset consists of realistic user-uploaded videos containing cluttered background and camera motion. It gives the largest diversity in terms of actions and with the presence of large variations in camera motion, cluttered background, illumination conditions. We created our own UCF10 Dataset, which basically include 2K total videos and 200 videos of each class. Train and test data list is present in Data folder. which were extracted from UCF101 for our project.
The motivation is to capture both, the spatial feature of the video as well as Temporal ones. For example, To classify either the action being performed by a person in the video is archery or playing basketball, Our spatial network captures the still-frame information about the action being performed. Basically it’s doing classification task on frame of video. Temporal network tries to distinguish the action using motion in the video. For this motion we are using segmentation based optical flow in both, horizontal and vertical directions. Both the models uses ResNet34 as the underlying network.