Recurrent Convolutional Action Recognizer

This project focuses on building a video classifier using the UCF101 dataset, a comprehensive collection of videos categorized into actions like cricket shots, punching, biking, etc.

Conceptual Foundations

Understanding the Dataset: The UCF101 dataset is widely recognized for building action recognizers, a key application of video classification. It comprises videos, each an ordered sequence of frames. These frames carry spatial information, while their sequencing imparts temporal information.

Architectural Approach: To effectively model both spatial and temporal aspects, a hybrid architecture combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) is employed. This architecture, known as CNN-RNN or CNN-LSTM, integrates the spatial processing strengths of CNNs with the temporal processing capabilities of RNNs, specifically using GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) layers.

Technical Prerequisites

TensorFlow 2.5 or higher is required.
A subsampled version of the UCF101 dataset is utilized in this project. The process of subsampling is detailed in this notebook, which provides insights into how the dataset was prepared for this specific application.

Models and Configuration Parameters

This project implements two distinct models for video classification. The first model utilizes a pre-trained InceptionV3 (trained on ImageNet) as the CNN-based spatial feature extractor combined with a GRU layer for temporal feature extraction. The second model employs EfficientNetB7 as the CNN component and an LSTM layer for processing temporal features. These models are employed to capture and learn from both spatial and temporal aspects of the video data.

Configuration Parameters

Specific parameters are set for image processing, model training, and sequence processing in video frames. These parameters ensure the model processes and learn from the data.

Image Processing and Model Training Parameters

IMAGE_DIMENSION: 600, 224 (EfficientNetB7, InceptionV3) The target size for resizing images. This dimension is used to standardize the size of the input images for consistent processing. Two sizes are used for the different models.
BATCH_SIZE: 64
Defines the number of samples that will be propagated through the network. A batch size of 64 is used for training the model.
TRAINING_EPOCHS: 60
The number of complete passes through the training dataset. The models will be trained for 60 epochs, with early stopping implemented to prevent overfitting if the validation loss does not improve after 15 epochs.

Sequence Processing Parameters

SEQUENCE_LENGTH: 20
The maximum length of the sequence of frames in a video. This parameter sets the number of frames to be considered for each video, ensuring uniformity in temporal dimension across all videos.
FEATURE_VECTOR_SIZE: 2560, 2048 (EfficientNetB7, InceptionV3) The number of features to be extracted from each frame. These feature vector sizes are crucial for capturing the necessary information from each frame for successful classification. Different sizes are utilized for the different models.

These configuration parameters play a pivotal role in the models' ability to learn from the video data and accurately classify actions, optimizing performance while balancing computational efficiency.

Video Processing Methodology

One of the primary challenges in training video classifiers is devising a method to efficiently feed videos into a neural network. Various strategies exist, as discussed in this blog post. Given that a video is essentially an ordered sequence of frames, a straightforward approach might be to extract these frames and form a 3D tensor. However, due to varying frame counts across different videos, this method can be problematic for batching unless padding is used.

Adopted Approach

In this project, we adopt a method similar to that used in text sequence problems. Our approach involves:

Capturing frames from each video.
Extracting frames until a maximum frame count is reached.
Padding videos with fewer frames than the maximum with zeros.

This method is particularly suitable for the UCF101 dataset, where there isn't significant variation in objects and actions across frames. However, it's important to note that this approach might not generalize well to other video classification tasks. We utilize OpenCV's VideoCapture() method for reading frames from videos.

Implemented Functions

The following functions are adapted from a TensorFlow tutorial on action recognition:

`center_crop_frame(input_frame)`

This function crops the center square of a given frame.

input_frame: The frame to be cropped.
Returns: Cropped frame.

The function calculates the center of the frame and crops it to form a square, ensuring uniform frame dimensions.

`process_video(video_path, max_frame_count, target_size)`

Processes and extracts frames from a video.

video_path: Path to the video file.
max_frame_count: Maximum number of frames to extract.
target_size: The dimensions to which each frame is resized.

The function reads the video, applies center cropping to each frame, resizes them, and reorders color channels. It then returns the processed frames as an array, adhering to the specified maximum frame count.

Feature Extraction Using Pre-Trained Models

To extract features from the frames of each video, leveraging a pre-trained network is a highly effective approach. The Keras Applications module offers several state-of-the-art models pre-trained on the ImageNet-1k dataset. For this project, we specifically utilize the InceptionV3, and EfficientNetB7 known for their efficiency and accuracy in image classification tasks.

InceptionV3 & EfficientNetB7 for Feature Extraction

The InceptionV3 and EfficientNetB7 models, pre-trained on ImageNet, are utilized to extract features from the video frames.

Function: `create_feature_extraction_model()`

This function builds a feature extraction model using the InceptionV3 and EfficientNetB7 architectures.

Returns: A Keras model specifically designed for feature extraction.

This setup results in a robust feature extraction model that can be applied to each frame of the videos.

An important step in preparing the dataset for training involves converting the string labels of each video into numeric indices. This conversion enables the model to process and learn from these labels effectively.

Creating a Label Processor

We implement a label processor using Keras' StringLookup layer, which converts string labels into their corresponding numeric indices.

Functionality of `StringLookup`

num_oov_indices: Set to 0, indicating the number of out-of-vocabulary indices.
vocabulary: The unique tags obtained from the training data. This creates a consistent mapping from string labels to numeric indices.

Preparing Videos: Feature Extraction and Mask Creation

To prepare the videos for the neural network, we need to extract features from each frame and create masks to handle varying video lengths. This process is essential for transforming raw video data into a format suitable for model training.

Function: `process_videos_and_extract_features(dataframe, directory_path)`

This function prepares all videos in a given dataframe by extracting features and creating masks.

dataframe: Contains information about the videos, like names and labels.
directory_path: The root directory where the videos are stored.

The function processes each video, extracts frame features using the previously defined feature extraction model, and creates masks to account for videos with frame counts less than the maximum sequence length.

Input Layers: There are two input layers, one for the frame features and another for the sequence mask.
GRU and LSTM Layers: The GRU and LSTM layers are designed to process the sequence of frame features, taking into account the sequence mask. This helps the model focus on the relevant parts of the video.
Output Layer: The final layer is a dense layer with a softmax activation function, corresponding to the number of unique tags (classes) in the dataset.

Model Training and Evaluation

In this phase, we focus on training the RNN sequence model and evaluating its performance on the test dataset. The process involves using callbacks for efficient training and the preservation of the best-performing model weights.

Conducting the Experiment

The conduct_experiment() function encapsulates the entire process of training and evaluating the model.

Predicting Video Sequences and Visualization

The final aspect of the project involves predicting actions in videos and visualizing the results. This process includes preparing the video frames for prediction, performing the sequence prediction, and converting the video frames to a GIF for an easy-to-understand visual representation.

Preparing Video for Prediction

Function: `prepare_video_for_prediction(video_frames)`

Prepares a single video's frames for prediction by the sequence model.

video_frames: Frames of the video to be processed.
Returns: Processed frame features and frame mask.

The function processes each frame, extracts the features, and creates a mask to handle videos with fewer frames than the maximum sequence length.

Performing Sequence Prediction

Function: `predict_video_sequence(video_path)`

Performs sequence prediction on a given video.

video_path: Path to the video file.
Returns: Frames of the video.

The function predicts the probability of each class for the given video and prints the predictions.

Visualization Utility: Converting Frames to GIF

Function: `frames_to_gif(video_frames)`

Converts a sequence of video frames into a GIF.

video_frames: Frames of the video.
Returns: IPython Image display object of the created GIF.

This utility function is useful for visualizing the video frames in a more engaging and understandable format.

Next Steps and Future Work

While the current project establishes a solid foundation for video classification, there are several avenues for future enhancements and experiments to further improve performance and adaptability.

Fine-Tuning Pre-Trained Networks

Fine-Tuning: Experiment with fine-tuning the pre-trained CNN-based networks (like InceptionV3 or EfficientNetB7) used for feature extraction. Adjusting these networks specifically for your dataset can potentially improve results.

Exploring Model Variants

Speed-Accuracy Trade-offs: Investigate other models within keras.applications to balance speed and accuracy. Each model offers different benefits and compromises.
Sequence Length Variations: Experiment with different values for MAX_SEQ_LENGTH. Observe how altering the maximum sequence length affects performance.
Training on More Classes: Expand the number of classes in the training dataset to challenge the model's ability to generalize and handle more diverse data.

Advanced Techniques and Models

Pre-Trained Action Recognition Models: Utilize pre-trained action recognition models like those from DeepMind, as detailed in this TensorFlow tutorial.
Rolling-Averaging Technique: Implement rolling-averaging with standard image classification models for video classification. This tutorial provides insights into using this technique.
Self-Attention for Frame Importance: In scenarios with significant variations between frames, incorporating a self-attention layer in the sequence model can help focus on the most relevant frames for classification.
Transformers for Video Processing: Explore the implementation of Transformers-based models for processing videos, as explained in this book chapter. Transformers can offer significant advantages in understanding the complex temporal dynamics in videos.

Data Augmentation

Augmentation Techniques: Implement data augmentation techniques to increase the diversity of the training dataset, which can lead to better generalization and robustness of the model.

Acknowledgements

A heartfelt thank you to Sayak Paul for his invaluable contribution and insight.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
EfficientNet-LSTM.ipynb		EfficientNet-LSTM.ipynb
InceptionV3-GRU.ipynb		InceptionV3-GRU.ipynb
LICENSE		LICENSE
README.md		README.md
UCF101_Data_Preparation_Top5.ipynb		UCF101_Data_Preparation_Top5.ipynb

License

AliAmini93/CNN-LSTM-Action-Recognizer

Folders and files

Latest commit

History

Repository files navigation

Recurrent Convolutional Action Recognizer

Conceptual Foundations

Technical Prerequisites

Models and Configuration Parameters

Configuration Parameters

Image Processing and Model Training Parameters

Sequence Processing Parameters

Video Processing Methodology

Adopted Approach

Implemented Functions

center_crop_frame(input_frame)

process_video(video_path, max_frame_count, target_size)

Feature Extraction Using Pre-Trained Models

InceptionV3 & EfficientNetB7 for Feature Extraction

Function: create_feature_extraction_model()

Creating a Label Processor

Functionality of StringLookup

Preparing Videos: Feature Extraction and Mask Creation

Function: process_videos_and_extract_features(dataframe, directory_path)

Model Training and Evaluation

Conducting the Experiment

Predicting Video Sequences and Visualization

Preparing Video for Prediction

Function: prepare_video_for_prediction(video_frames)

Performing Sequence Prediction

Function: predict_video_sequence(video_path)

Visualization Utility: Converting Frames to GIF

Function: frames_to_gif(video_frames)

Next Steps and Future Work

Fine-Tuning Pre-Trained Networks

Exploring Model Variants

Advanced Techniques and Models

Data Augmentation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages