The official implementation of the paper "Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition". Our code is based on timm and ConvNeXt.
Our paper is accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
- torch>=1.12.1
- torchvision>=0.13.1
- timm>=0.9.12
from models.conv2former import conv2former_n
# create a Conv2Former model with 1000 classes
model = conv2former_n(num_classes=1000)
Input image should be normalized as follows:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
We also provide the Jittor version of the model:
import jittor as jt
from jittor_models.conv2former import conv2former_n
# create a Conv2Former model with 1000 classes
model = conv2former_n(num_classes=1000)
bash distributed_train.sh 8 $DATA_DIR --model $MODEL -b $BS --lr $LR --drop-path $DROP_PATH
# DATA_DIR: path to the dataset
# MODEL: name of the model
# BS: batch size
# LR: learning rate
# DROP_PATH: drop path rate
python validate.py $DATA_DIR --model $MODEL --checkpoint $CHECKPOINT
# DATA_DIR: path to the dataset
# MODEL: name of the model
# CHECKPOINT: path to the saved checkpoint
Model | Parameters | FLOPs | Image resolution | Top 1 Acc. | Model File |
---|---|---|---|---|---|
Conv2Former-N | 15M | 2.2G | 224 | 81.5% | baiduyun |
SwinT-T | 28M | 4.5G | 224 | 81.5% | - |
ConvNeXt-T | 29M | 4.5G | 224 | 82.1% | - |
Conv2Former-T | 27M | 4.4G | 224 | 83.2% | baiduyun |
SwinT-S | 50M | 8.7G | 224 | 83.0% | - |
ConvNeXt-S | 50M | 8.7G | 224 | 83.1% | - |
Conv2Former-S | 50M | 8.7G | 224 | 84.1% | baiduyun |
RepLKNet-31B | 79M | 15.3G | 224 | 83.5% | - |
SwinT-B | 88M | 15.4G | 224 | 83.5% | - |
ConvNeXt-B | 89M | 15.4G | 224 | 83.8% | - |
FocalNet-B | 89M | 15.4G | 224 | 83.9% | - |
Conv2Former-B | 90M | 15.9G | 224 | 84.4% | baiduyun |
Model | Parameters | FLOPs | Image resolution | Top 1 Acc. | Model File |
---|---|---|---|---|---|
ConvNeXt-S | 50M | 8.7G | 224 | 84.6% | - |
Conv2Former-S | 50M | 8.7G | 224 | 84.9% | baiduyun |
SwinT-B | 88M | 15.4G | 224 | 85.2% | - |
ConvNeXt-B | 89M | 15.4G | 224 | 85.8% | - |
Conv2Former-B | 90M | 15.9G | 224 | 86.2% | baiduyun |
SwinT-B | 88M | 47.0G | 384 | 86.4% | - |
ConvNeXt-B | 89M | 45.1G | 384 | 86.8% | - |
Conv2Former-B | 90M | 46.7G | 384 | 87.0% | - |
SwinT-L | 197M | 34.5G | 224 | 86.3% | - |
ConvNeXt-L | 198M | 34.4G | 224 | 86.6% | - |
Conv2Former-L | 199M | 36.0G | 224 | 87.0% | baiduyun |
EffNet-V2-XL | 208M | 94G | 480 | 87.3% | - |
SwinT-L | 197M | 104G | 384 | 87.3% | - |
ConvNeXt-L | 198M | 101G | 384 | 87.5% | - |
CoAtNet-3 | 168M | 107G | 384 | 87.6% | - |
Conv2Former-L | 199M | 106G | 384 | 87.7% | baiduyun |
If you find this work or code is helpful in your research, please cite:
@article{hou2024conv2former,
title={Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition},
author={Hou, Qibin and Lu, Cheng-Ze and Cheng, Ming-Ming and Feng, Jiashi},
journal={IEEE TPAMI},
year={2024},
doi={10.1109/TPAMI.2024.3401450},
}
You may want to cite:
@inproceedings{liu2022convnet,
title={A ConvNet for the 2020s},
author={Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
booktitle=CVPR,
year={2022}
}
@inproceedings{liu2021swin,
title={Swin transformer: Hierarchical vision transformer using shifted windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
booktitle=ICCV,
year={2021}
}
@inproceedings{tan2021efficientnetv2,
title={Efficientnetv2: Smaller models and faster training},
author={Tan, Mingxing and Le, Quoc},
booktitle=ICML,
pages={10096--10106},
year={2021},
organization={PMLR}
}
@misc{focalmnet,
author = {Yang, Jianwei and Li, Chunyuan and Gao, Jianfeng},
title = {Focal Modulation Networks},
publisher = {arXiv},
year = {2022},
}
@article{dai2021coatnet,
title={Coatnet: Marrying convolution and attention for all data sizes},
author={Dai, Zihang and Liu, Hanxiao and Le, Quoc and Tan, Mingxing},
journal=NIPS,
volume={34},
year={2021}
}
@inproceedings{replknet,
author = {Ding, Xiaohan and Zhang, Xiangyu and Zhou, Yizhuang and Han, Jungong and Ding, Guiguang and Sun, Jian},
title = {Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs},
booktitle=CVPR,
year = {2022},
}