We use our Depth Anything pre-trained ViT-L encoder to fine-tune downstream semantic segmentation models.
Note that our results are obtained without Mapillary pre-training.
Method | Encoder | mIoU (s.s.) | m.s. |
---|---|---|---|
SegFormer | MiT-B5 | 82.4 | 84.0 |
Mask2Former | Swin-L | 83.3 | 84.3 |
OneFormer | Swin-L | 83.0 | 84.4 |
OneFormer | ConNeXt-XL | 83.6 | 84.6 |
DDP | ConNeXt-L | 83.2 | 83.9 |
Ours | ViT-L | 84.8 | 86.2 |
Method | Encoder | mIoU |
---|---|---|
SegFormer | MiT-B5 | 51.0 |
Mask2Former | Swin-L | 56.4 |
UperNet | BEiT-L | 56.3 |
ViT-Adapter | BEiT-L | 58.3 |
OneFormer | Swin-L | 57.4 |
OneFormer | ConNeXt-XL | 57.4 |
Ours | ViT-L | 59.4 |
Please refer to MMSegmentation for instructions. Do not forget to install mmdet
to support Mask2Former
:
pip install "mmdet>=3.0.0rc4"
After installation:
- move our config/depth_anything to mmseg's config
- move our dinov2.py to mmseg's backbones
- add DINOv2 in mmseg's models/backbones/__init__.py
- download our provided torchhub directory and put it at the root of your working directory
- download the Depth Anything pre-trained model (to initialize the encoder) and 2) put it under the
checkpoints
folder.
For training or inference with our pre-trained models, please refer to MMSegmentation instructions.