This project benchmarks State-Of-The-Art (SOTA) Foundation Vision Models for a variety of visual recognition tasks, including image classification, object detection, and semantic segmentation.
Core Visual Recognition Tasks | Description | Examples |
---|---|---|
Image Classification | Assigns a label to an image. | Classifying an image as "cat" or "dog." |
Object Detection | Detects and localizes objects in an image. | Detecting cars and pedestrians in a street scene. |
Semantic Segmentation | Classifies each pixel into a category. | Separating road, sky, and pedestrians in an image. |
Instance Segmentation | Identifies individual instances of objects and their masks. | Labeling each pedestrian in a crowd separately. |
Image Captioning | Generates a textual description of an image. | "A dog playing in the park." |
Action Recognition | Identifies actions in an image or video. | Recognizing someone is "running" or "jumping." |
Pose Estimation | Estimates joint locations of humans or animals. | Detecting body pose in yoga poses. |
Advanced Visual Recognition Tasks | Description | Examples |
---|---|---|
Image Segmentation (General) | Divides an image into meaningful regions with pixel-level accuracy. | Separating a cat from the background. |
Depth Estimation | Predicts depth for each pixel in an image. | Estimating distances in a 3D scene. |
3D Reconstruction from Images | Reconstructs a 3D model from multiple images. | Building a 3D model of a building from photos. |
OCR (Optical Character Recognition) | Recognizes and extracts text from images. | Reading a street sign in a photograph. |
Image Super-Resolution | Enhances the resolution of an image. | Upscaling a low-resolution image to higher resolution. |
Image Inpainting | Fills in missing or corrupted parts of an image. | Restoring damaged areas in an old photograph. |
Image Style Transfer | Transfers the style of one image to another. | Applying Van Gogh’s painting style to a photo. |
Video-Based Visual Recognition Tasks | Description | Examples |
---|---|---|
Video Classification | Classifies video sequences based on content. | Identifying a video as "sports" or "news." |
Object Tracking | Continuously tracks objects across frames. | Following a car in a traffic video. |
Video Action Recognition | Recognizes actions in a video sequence. | Identifying a soccer player "kicking a ball." |
Video Segmentation | Performs segmentation across video frames. | Segmenting a moving car from the background. |
Vision Odometry | Estimates camera motion from a sequence of images. | Estimating a self-driving car's movement. |
3D Object Detection from Video | Detects objects and estimates their 3D positions in video. | Detecting pedestrians in a video from a self-driving car. |
Action Detection | Identifies specific actions or events in a video stream. | Detecting "running" in a surveillance video. |
Video Captioning | Generates textual descriptions for video content. | "A person is playing guitar in the park." |
Video Summarization | Creates a condensed version of a video by highlighting key scenes. | Summarizing a 10-minute soccer match into key highlights. |
Video Prediction | Predicts future frames in a video sequence. | Anticipating the next frame in a moving car video. |
Specialized Tasks | Description | Examples |
---|---|---|
Self-Supervised Learning (SSL) | Learns features from unlabeled data. | Pretraining a model on large video datasets without labels. |
Zero-Shot Classification | Classifies new caories not seen during training. | Recognizing new objects in images using CLIP. |
Multi-Modal Image-Text Analysis | Combines image and text for analysis tasks. | Answering questions about image content. |
Emerging Research Areas | Description | Examples |
---|---|---|
Multi-Modal Learning | Combines visual data with other modalities like text or sound. | Combining video and audio for sentiment analysis. |
Few-Shot Learning | Learns to recognize new classes from few labeled examples. | Training on a new animal species with just a few images. |
If you want to contribute to this project, you are welcome to do so. You can either add new projects, improve existing ones, or fix bugs and errors.
Please follow these steps to contribute:
- Fork this repository and clone it to your local machine.
- Create a new branch with a descriptive name for your contribution.
- Add your code and files to the branch and commit your changes.
- Push your branch to your forked repository and create a pull request to the main repository.
- Wait for your pull request to be reviewed and merged.
SOTA Vision Foundation Models Benchmarking Resources:
- Vision Models Benchmarking for Visual Recognition Tasks
- Video Models Benchmarking for Visual Recognition Tasks
Built-In Tools
- M6 - Vision AI tool
- Vision AI: Image & Visual AI Tools - Google Cloud API
- Create a custom Image Analysis model (preview) - Azure AI Vision
Vision Foundation Models resources:
- Recent Advances in Vision Foundation Models CVPR 2024: https://cvpr.thecvf.com/virtual/2023/tutorial/18558
- Foundation Models for Vision - HF collection by @merve : https://huggingface.co/collections/merve/foundation-models-for-vision-6516d5c6af977f435be43ace
- Roboflow: https://roboflow.com/model-feature/foundation-vision
- The Tenyks Blogger: https://medium.com/@tenyks_blogger/the-foundation-models-reshaping-computer-vision-b299a91527fb