[WIP][Add] Mask2Former #1059

ariG23498 · 2025-02-11T04:58:50Z

ariG23498 · 2025-02-13T06:26:51Z

Hi @qubvel some questions before I start contributing to the core part of the model:

This model seems to have three parts, a pixel encoder, a pixel decoder and a transformer decoder. While I understand that I do not have to write the pixel encoder, as that can be directly retrieved from timm, the only bits I would need to contribute right now would be the two decoders.
Do we want this to be inference only at first? To make this happen my workflow would be to copy code from the transformers implementation and make the weight conversion (only if required) and pass an image to the model to see the semgentation maps correctly.
I am unsure about the processor. Do you think we should concentrate on that as well?

Let me know what you think.

qubvel · 2025-02-13T11:48:09Z

Hey @ariG23498, thanks for the questions, I tried to answer them, but let me know if anything is unclear

Yes
AFAIU the semantic segmentation model should be trainable with existing tutorials, otherwise, we can make a new tutorial (if there are any nuances)
We can use Albumentations for preprocessing - similar to what I used for segformer + to create a notebook on how to make an inference. See also the snippet for Segformer https://huggingface.co/smp-hub/segformer-b3-512x512-ade-160k

init

9342471

Provide feedback