Skip to content

Latest commit

 

History

History

L2_07_SOTA_Vision_Foundation_Models_Benchmarking

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

L2-07: SOTA Foundation Vision Models Benchmarking for Visual Recognition

Overview

This project benchmarks State-Of-The-Art (SOTA) Foundation Vision Models for a variety of visual recognition tasks, including image classification, object detection, and semantic segmentation.

Core Visual Recognition Tasks

Core Visual Recognition Tasks Description Examples
Image Classification Assigns a label to an image. Classifying an image as "cat" or "dog."
Object Detection Detects and localizes objects in an image. Detecting cars and pedestrians in a street scene.
Semantic Segmentation Classifies each pixel into a category. Separating road, sky, and pedestrians in an image.
Instance Segmentation Identifies individual instances of objects and their masks. Labeling each pedestrian in a crowd separately.
Image Captioning Generates a textual description of an image. "A dog playing in the park."
Action Recognition Identifies actions in an image or video. Recognizing someone is "running" or "jumping."
Pose Estimation Estimates joint locations of humans or animals. Detecting body pose in yoga poses.

Advanced Visual Recognition Tasks

Advanced Visual Recognition Tasks Description Examples
Image Segmentation (General) Divides an image into meaningful regions with pixel-level accuracy. Separating a cat from the background.
Depth Estimation Predicts depth for each pixel in an image. Estimating distances in a 3D scene.
3D Reconstruction from Images Reconstructs a 3D model from multiple images. Building a 3D model of a building from photos.
OCR (Optical Character Recognition) Recognizes and extracts text from images. Reading a street sign in a photograph.
Image Super-Resolution Enhances the resolution of an image. Upscaling a low-resolution image to higher resolution.
Image Inpainting Fills in missing or corrupted parts of an image. Restoring damaged areas in an old photograph.
Image Style Transfer Transfers the style of one image to another. Applying Van Gogh’s painting style to a photo.

Video-Based Visual Recognition Tasks

Video-Based Visual Recognition Tasks Description Examples
Video Classification Classifies video sequences based on content. Identifying a video as "sports" or "news."
Object Tracking Continuously tracks objects across frames. Following a car in a traffic video.
Video Action Recognition Recognizes actions in a video sequence. Identifying a soccer player "kicking a ball."
Video Segmentation Performs segmentation across video frames. Segmenting a moving car from the background.
Vision Odometry Estimates camera motion from a sequence of images. Estimating a self-driving car's movement.
3D Object Detection from Video Detects objects and estimates their 3D positions in video. Detecting pedestrians in a video from a self-driving car.
Action Detection Identifies specific actions or events in a video stream. Detecting "running" in a surveillance video.
Video Captioning Generates textual descriptions for video content. "A person is playing guitar in the park."
Video Summarization Creates a condensed version of a video by highlighting key scenes. Summarizing a 10-minute soccer match into key highlights.
Video Prediction Predicts future frames in a video sequence. Anticipating the next frame in a moving car video.

Specialized Tasks

Specialized Tasks Description Examples
Self-Supervised Learning (SSL) Learns features from unlabeled data. Pretraining a model on large video datasets without labels.
Zero-Shot Classification Classifies new caories not seen during training. Recognizing new objects in images using CLIP.
Multi-Modal Image-Text Analysis Combines image and text for analysis tasks. Answering questions about image content.

Emerging Research Areas

Emerging Research Areas Description Examples
Multi-Modal Learning Combines visual data with other modalities like text or sound. Combining video and audio for sentiment analysis.
Few-Shot Learning Learns to recognize new classes from few labeled examples. Training on a new animal species with just a few images.

Contributing

If you want to contribute to this project, you are welcome to do so. You can either add new projects, improve existing ones, or fix bugs and errors.

Please follow these steps to contribute:

  • Fork this repository and clone it to your local machine.
  • Create a new branch with a descriptive name for your contribution.
  • Add your code and files to the branch and commit your changes.
  • Push your branch to your forked repository and create a pull request to the main repository.
  • Wait for your pull request to be reviewed and merged.

References

SOTA Vision Foundation Models Benchmarking Resources:

Built-In Tools

Vision Foundation Models resources: