This project focuses on building a video classifier using the UCF101 dataset, a comprehensive collection of videos categorized into actions like cricket shots, punching, biking, etc.
Understanding the Dataset: The UCF101 dataset is widely recognized for building action recognizers, a key application of video classification. It comprises videos, each an ordered sequence of frames. These frames carry spatial information, while their sequencing imparts temporal information.
Architectural Approach: To effectively model both spatial and temporal aspects, a hybrid architecture combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) is employed. This architecture, known as CNN-RNN or CNN-LSTM, integrates the spatial processing strengths of CNNs with the temporal processing capabilities of RNNs, specifically using GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) layers.
- TensorFlow 2.5 or higher is required.
- A subsampled version of the UCF101 dataset is utilized in this project. The process of subsampling is detailed in this notebook, which provides insights into how the dataset was prepared for this specific application.
This project implements two distinct models for video classification. The first model utilizes a pre-trained InceptionV3 (trained on ImageNet) as the CNN-based spatial feature extractor combined with a GRU layer for temporal feature extraction. The second model employs EfficientNetB7 as the CNN component and an LSTM layer for processing temporal features. These models are employed to capture and learn from both spatial and temporal aspects of the video data.
Specific parameters are set for image processing, model training, and sequence processing in video frames. These parameters ensure the model processes and learn from the data.
-
IMAGE_DIMENSION
: 600, 224 (EfficientNetB7, InceptionV3) The target size for resizing images. This dimension is used to standardize the size of the input images for consistent processing. Two sizes are used for the different models. -
BATCH_SIZE
: 64
Defines the number of samples that will be propagated through the network. A batch size of 64 is used for training the model. -
TRAINING_EPOCHS
: 60
The number of complete passes through the training dataset. The models will be trained for 60 epochs, with early stopping implemented to prevent overfitting if the validation loss does not improve after 15 epochs.
-
SEQUENCE_LENGTH
: 20
The maximum length of the sequence of frames in a video. This parameter sets the number of frames to be considered for each video, ensuring uniformity in temporal dimension across all videos. -
FEATURE_VECTOR_SIZE
: 2560, 2048 (EfficientNetB7, InceptionV3) The number of features to be extracted from each frame. These feature vector sizes are crucial for capturing the necessary information from each frame for successful classification. Different sizes are utilized for the different models.
These configuration parameters play a pivotal role in the models' ability to learn from the video data and accurately classify actions, optimizing performance while balancing computational efficiency.
One of the primary challenges in training video classifiers is devising a method to efficiently feed videos into a neural network. Various strategies exist, as discussed in this blog post. Given that a video is essentially an ordered sequence of frames, a straightforward approach might be to extract these frames and form a 3D tensor. However, due to varying frame counts across different videos, this method can be problematic for batching unless padding is used.
In this project, we adopt a method similar to that used in text sequence problems. Our approach involves:
- Capturing frames from each video.
- Extracting frames until a maximum frame count is reached.
- Padding videos with fewer frames than the maximum with zeros.
This method is particularly suitable for the UCF101 dataset, where there isn't significant variation in objects and actions across frames. However, it's important to note that this approach might not generalize well to other video classification tasks. We utilize OpenCV's VideoCapture()
method for reading frames from videos.
The following functions are adapted from a TensorFlow tutorial on action recognition:
This function crops the center square of a given frame.
input_frame
: The frame to be cropped.- Returns: Cropped frame.
The function calculates the center of the frame and crops it to form a square, ensuring uniform frame dimensions.
Processes and extracts frames from a video.
video_path
: Path to the video file.max_frame_count
: Maximum number of frames to extract.target_size
: The dimensions to which each frame is resized.
The function reads the video, applies center cropping to each frame, resizes them, and reorders color channels. It then returns the processed frames as an array, adhering to the specified maximum frame count.
To extract features from the frames of each video, leveraging a pre-trained network is a highly effective approach. The Keras Applications
module offers several state-of-the-art models pre-trained on the ImageNet-1k dataset. For this project, we specifically utilize the InceptionV3, and EfficientNetB7 known for their efficiency and accuracy in image classification tasks.
The InceptionV3 and EfficientNetB7 models, pre-trained on ImageNet, are utilized to extract features from the video frames.
This function builds a feature extraction model using the InceptionV3 and EfficientNetB7 architectures.
- Returns: A Keras model specifically designed for feature extraction.
This setup results in a robust feature extraction model that can be applied to each frame of the videos.
An important step in preparing the dataset for training involves converting the string labels of each video into numeric indices. This conversion enables the model to process and learn from these labels effectively.
We implement a label processor using Keras' StringLookup
layer, which converts string labels into their corresponding numeric indices.
num_oov_indices
: Set to 0, indicating the number of out-of-vocabulary indices.vocabulary
: The unique tags obtained from the training data. This creates a consistent mapping from string labels to numeric indices.
To prepare the videos for the neural network, we need to extract features from each frame and create masks to handle varying video lengths. This process is essential for transforming raw video data into a format suitable for model training.
This function prepares all videos in a given dataframe by extracting features and creating masks.
dataframe
: Contains information about the videos, like names and labels.directory_path
: The root directory where the videos are stored.
The function processes each video, extracts frame features using the previously defined feature extraction model, and creates masks to account for videos with frame counts less than the maximum sequence length.
-
Input Layers: There are two input layers, one for the frame features and another for the sequence mask.
-
GRU and LSTM Layers: The GRU and LSTM layers are designed to process the sequence of frame features, taking into account the sequence mask. This helps the model focus on the relevant parts of the video.
-
Output Layer: The final layer is a dense layer with a softmax activation function, corresponding to the number of unique tags (classes) in the dataset.
In this phase, we focus on training the RNN sequence model and evaluating its performance on the test dataset. The process involves using callbacks for efficient training and the preservation of the best-performing model weights.
The conduct_experiment()
function encapsulates the entire process of training and evaluating the model.
The final aspect of the project involves predicting actions in videos and visualizing the results. This process includes preparing the video frames for prediction, performing the sequence prediction, and converting the video frames to a GIF for an easy-to-understand visual representation.
Prepares a single video's frames for prediction by the sequence model.
video_frames
: Frames of the video to be processed.- Returns: Processed frame features and frame mask.
The function processes each frame, extracts the features, and creates a mask to handle videos with fewer frames than the maximum sequence length.
Performs sequence prediction on a given video.
video_path
: Path to the video file.- Returns: Frames of the video.
The function predicts the probability of each class for the given video and prints the predictions.
Converts a sequence of video frames into a GIF.
video_frames
: Frames of the video.- Returns: IPython Image display object of the created GIF.
This utility function is useful for visualizing the video frames in a more engaging and understandable format.
While the current project establishes a solid foundation for video classification, there are several avenues for future enhancements and experiments to further improve performance and adaptability.
- Fine-Tuning: Experiment with fine-tuning the pre-trained CNN-based networks (like InceptionV3 or EfficientNetB7) used for feature extraction. Adjusting these networks specifically for your dataset can potentially improve results.
-
Speed-Accuracy Trade-offs: Investigate other models within
keras.applications
to balance speed and accuracy. Each model offers different benefits and compromises. -
Sequence Length Variations: Experiment with different values for
MAX_SEQ_LENGTH
. Observe how altering the maximum sequence length affects performance. -
Training on More Classes: Expand the number of classes in the training dataset to challenge the model's ability to generalize and handle more diverse data.
-
Pre-Trained Action Recognition Models: Utilize pre-trained action recognition models like those from DeepMind, as detailed in this TensorFlow tutorial.
-
Rolling-Averaging Technique: Implement rolling-averaging with standard image classification models for video classification. This tutorial provides insights into using this technique.
-
Self-Attention for Frame Importance: In scenarios with significant variations between frames, incorporating a self-attention layer in the sequence model can help focus on the most relevant frames for classification.
-
Transformers for Video Processing: Explore the implementation of Transformers-based models for processing videos, as explained in this book chapter. Transformers can offer significant advantages in understanding the complex temporal dynamics in videos.
- Augmentation Techniques: Implement data augmentation techniques to increase the diversity of the training dataset, which can lead to better generalization and robustness of the model.
A heartfelt thank you to Sayak Paul for his invaluable contribution and insight.