Add Mask2Former to SMP #1044

caxel-ap · 2025-01-24T16:38:42Z

Mask2Former model was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository.

Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.

Papers with Code
https://paperswithcode.com/paper/masked-attention-mask-transformer-for

Paper:
https://arxiv.org/abs/2112.01527

HF Reference implementation:
https://huggingface.co/docs/transformers/main/en/model_doc/mask2former
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mask2former/modeling_mask2former.py

qubvel · 2025-01-25T00:22:05Z

Thanks for opening an issue @caxel-ap 🤗 It might be a first instance segmentation model in the library, let's see if anyone would be a eager to contribute, I suppose it will be super impactful 👍

caxel-ap · 2025-01-25T02:32:30Z

Even just for semantic segmentation would be great to have in here someday, I’ve had good results using it in transformers for semantic with https://huggingface.co/facebook/mask2former-swin-large-ade-semantic

ariG23498 · 2025-02-07T13:39:08Z

Hey @qubvel !

I would like to work on it. Is there a guideline on how to contribute to this repo? My todos would be:

Read the resources shared
Read contribution guideline if available
Start a draft PR and iterate on it.

Thanks!

qubvel · 2025-02-07T14:05:32Z

Hey @ariG23498! That's super cool, thanks for your interest 🤗

At the moment there are no guidelines, but you can get inspiration from any of the existing models. The code for existing models is relatively small, so you can just copy decoders/unet and start from that point.

There is no need to implement an encoder, as far as I understand Swin should be compatible with timm model.
As suggested above we can start with semantic segmentation decoder and see if we can extend it to instance/panoptic as well.
The main idea is to have decoder.py and model.py files under decoders/mask2former.

Just let me know what questions you will face and I will try to answer, and then add it to the docs 🤗

qubvel added help wanted Extra attention is needed new-model good difficult issue labels Jan 25, 2025

ariG23498 linked a pull request Feb 11, 2025 that will close this issue

[WIP][Add] Mask2Former #1059

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mask2Former to SMP #1044

Add Mask2Former to SMP #1044

caxel-ap commented Jan 24, 2025

qubvel commented Jan 25, 2025

caxel-ap commented Jan 25, 2025

ariG23498 commented Feb 7, 2025

qubvel commented Feb 7, 2025 •

edited

Loading

Add Mask2Former to SMP #1044

Add Mask2Former to SMP #1044

Comments

caxel-ap commented Jan 24, 2025

qubvel commented Jan 25, 2025

caxel-ap commented Jan 25, 2025

ariG23498 commented Feb 7, 2025

qubvel commented Feb 7, 2025 • edited Loading

qubvel commented Feb 7, 2025 •

edited

Loading