Skip to content

ekumenlabs/HEART-MET-VideoMAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HEART-MET Assistive Robot Challenge with VideoMae Transformers

Olmer Garcia-Bedoya, Jose Tomas Lorente , Sebastian Murcia

Ekumen Inc

Introduction

This report presents the methodology and some conclusions learned during theicsr-2022 HEART-MET Assistive Robot Challenge. The result of the challenge can be seen in CodeLab.

The task in the challenge is to recognize human activities from videos. The videos are recorded from robots operating in a domestic environment and include activities such as reading a book, drinking water, falling on the floor, etc. HEART-MET is one of the competitions in the METRICS project, which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 871252. The competition aims to benchmark assistive robots performing healthcare-related tasks in unstructured domestic environments.

Ekumen Inc is an international engineering boutique, provider of advanced software development services and technology. We specialize in bridging the gap between scientific research and deployable software products, with experience in open source projects like ROS. We specialize in the following areas: Robotics Software Applications, Web and Mobile Technology, Embedded Systems and augmented reality applications.

After getting a lot of data successfully in natural language processing (NLP) by self-supervised learning. The solutions, based on autoregressive language modeling in GPT and masked autoencoding in BERT , are conceptually simple:they remove a portion of the data and learn to predict the removed content. Application of transformer in images start with An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, which outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy, where Encoding: key,value(position in image of subimage, subimage). After this concept Facebook created Masked Autoencoders Are Scalable Vision Learners which is the base of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training the model used during this challenge. In the next section is presented some insight about the methodology.

Methodology

videoMAE is a Model transformer with video classification, like other transformers it is trained in two stages. The first stages called pre-taining is created by an encoder and a decoder which require a lot computational power. The input receives a pipeline of masking random cubes take from video composed by 16 frames by 16 fixed-size patches of each image frame over a video of resolution 224x224. The output of the decoder receives the complete video with the idea of "challenging self-supervisory tasks that require holistic understanding". The second stage, called fine tuning, consists in a dense layer with input the output of the encoder and the output the number of classification classes for the video.

In huggingFace, videoMAE is a PyTorchtorch.nn.Module subclass. The hugging face data and model come from the authors repository, but in our case we start from the fine tuning model present in HuggingFace of Multimedia Computing Group-Nanjing University. Specifically, we start making the fine tuning process of the last layer of the model videomae-base-finetuned-kinetics_.

Conclusion

The most interesting characteristic of VideoMAE approach was the speed in the fine tuning part, because with less than 30 epoch (which takes around 50 minutes in a NVIDIA 3090TI) we get an 0.68 over the validation dataset. This contrasts with the results that we obtain with the base model which take around this time by each epoch. Although we tested different approaches to improve the results we did not find any solution, we think that the principal problem comes from the data which is unbalanced , and many videos could be classified in different classes. Adding some sort of memory neuron to the model, or some sort of information about like optical flow, to convey more information about the video than only 16 frames could have also improved the result.

We also tested X-CLIP (X-CLIP HuggingFace , X-CLIP Github) but gave 36% accuracy over the training dataset without any training process. We think this could be an interesting approach increasing the description of the categories taking into account classes of the kinetics, because it can give an initial classification for labeling the dataset.

Next steps, after try balance the data , include start the pretraining from the large model of kinetics-400 or from another pre-trained model available (Something-Something V2 or AVA 2.2), change from half precision to double precision.

1 https://viso.ai/deep-learning/vision-transformer-vit/

Licensing

This code is licensed under the Apache License, Version 2.0.

Copyright 2022- Ekumen Inc. All rights reserved.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published