OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR, 2024
Xiangtai Li · Haobo Yuan . Wei Li . Henghui Ding · Size Wu · Wenwei Zhang ·
Yining Li . Kai Chen . Chen Change Loy

S-Lab, MMlab@NTU, Shanghai AI Laboratory

Xiangtai is the project leader and corresponding author.

Short Introduction

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the Segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to fill all these tasks in one model and achieve good enough performance.

We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Both the code and models will be publicly available.

Short introduction on VALSE of OMG-Seg with other related work, can be found here, in Chinese.

Experiment Set Up

Dataset

See DATASET.md

Install

Our codebase is built with MMdetection-3.0 tools.

See INSTALL.md

Quick Start

Experiment Preparation

First set up the dataset and environment. Make sure you have fixed and corresponding versions.
Download pre-trained CLIP backbone. The scripts will automatically download the pre-trained CLIP models.
Generate CLIP text embedding for each dataset and joint merged dataset for co-training. See the embedding generation.
Run the train/test scripts below to carry out experiments on model training and testing.

Train

See the configs under seg/configs/m2ov_train.

./tools/dist.sh train seg/configs/m2ov_train/omg_convl_vlm_fix_12e_ov_coco_vid_yt19_vip_city_cocopansam.py  8 --checkpoint pre_trained_model_path

Note that you can also use CLIP pre-trained models, by running the following command.

./tools/dist.sh train seg/configs/m2ov_train/omg_convl_vlm_fix_12e_ov_coco_vid_yt19_vip_city_cocopansam.py  8

We adopt slurm to train our model with 32 A100 GPUS.

PARTITION=YOUR_PARTITION JOB_NAME=YOUR_JOB_NAME GPUS=32 GPUS_PER_NODE=8 ./tools/slurm.sh train seg/configs/m2ov_train/omg_convl_vlm_fix_12e_ov_coco_vid_yt19_vip_city_cocopansam.py

Demo Scripts

Run the visualization scripts on COCO

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_coco.py 1 --checkpoint model_path --show-dir vis

Run the visualization scripts on VIPSeg

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_vipseg.py 1 --checkpoint model_path --show-dir vis

The color maps are dumped in the sub-folder vis in work_dir.

Test

See the configs under seg/configs/m2ov_val. Make sure you have set up the classification embeddings for testing.

Test Cityscape dataset, we observe 0.3% noises for Cityscapes panoptic segmentation

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_cityscapes.py 4 --checkpoint model_path

Test COCO dataset, we observe 0.5% noises for COCO panoptic segmentation

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_coco.py 4 --checkpoint model_path

Test Open-Vocabulary ADE dataset, we observe 0.8% noises for COCO panoptic segmentation

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_ade.py 4 --checkpoint model_path

Test Interactive COCO segmentation:

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_ov_coco_pan_point.py 4 --checkpoint model_path

Test Youtube-VIS-19 dataset

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_y19.py 4 --checkpoint model_path

Test VIP-Seg dataset

./tools/dist.sh test seg/configs/m2ov_val/eval_m2_convl_300q_ov_vipseg.py 1 --checkpoint model_path

Trained Model

ConvNeXt-large backbone. model

ConvNeXt-XX-large backbone. model

The Object-365 pretrained models can be found here.

Using one machine to re-run our codebase, ConvNeXt-large backbone. model, log.

Citation

If you think OMG-Seg codebase and models are useful for your research, please consider referring us:

@inproceedings{OMGSeg,
author       = {Xiangtai Li and
                  Haobo Yuan and
                  Wei Li and
                  Henghui Ding and
                  Size Wu and
                  Wenwei Zhang and
                  Yining Li and
                  Kai Chen and
                  Chen Change Loy},
  title        = {OMG-Seg: Is One Model Good Enough For All Segmentation?},
booktitle={CVPR},
  year={2024}
}

License

S-Lab LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMG_Seg_README.md

OMG_Seg_README.md

OMG-Seg: Is One Model Good Enough For All Segmentation?

Short Introduction

Experiment Set Up

Dataset

Install

Quick Start

Experiment Preparation

Train

Demo Scripts

Test

Trained Model

Citation

License

Files

OMG_Seg_README.md

Latest commit

History

OMG_Seg_README.md

File metadata and controls

OMG-Seg: Is One Model Good Enough For All Segmentation?

Short Introduction

Experiment Set Up

Dataset

Install

Quick Start

Experiment Preparation

Train

Demo Scripts

Test

Trained Model

Citation

License