Skip to content

Latest commit

 

History

History
110 lines (87 loc) · 9.64 KB

MODEL_GARDEN.md

File metadata and controls

110 lines (87 loc) · 9.64 KB

TF Vision Model Garden

Introduction

TF Vision model garden provides a large collection of baselines and checkpoints for image classification, object detection, and instance segmentation.

Image Classification

ImageNet Baselines

ResNet models trained with vanilla settings:

  • Models are trained from scratch with batch size 4096 and 1.6 initial learning rate.
  • Linear warmup is applied for the first 5 epochs.
  • Models trained with l2 weight regularization and ReLU activation.
model resolution epochs Top-1 Top-5 download
ResNet-50 224x224 90 76.1 92.9 config
ResNet-50 224x224 200 77.1 93.5 config
ResNet-101 224x224 200 78.3 94.2 config
ResNet-152 224x224 200 78.7 94.3 config

ResNet-RS models trained with settings including:

We support state-of-the-art ResNet-RS image classification models with features:

  • ResNet-RS architectural changes and Swish activation. (Note that ResNet-RS adopts ReLU activation in the paper.)
  • Regularization methods including Random Augment, 4e-5 weight decay, stochastic depth, label smoothing and dropout.
  • New training methods including a 350-epoch schedule, cosine learning rate and EMA.
  • Configs are in this directory.
model resolution params (M) Top-1 Top-5 download
ResNet-RS-50 160x160 35.7 79.1 94.5 config
ResNet-RS-101 160x160 63.7 80.2 94.9 config
ResNet-RS-101 192x192 63.7 81.3 95.6 config
ResNet-RS-152 192x192 86.8 81.9 95.8 config
ResNet-RS-152 224x224 86.8 82.5 96.1 config
ResNet-RS-152 256x256 86.8 83.1 96.3 config
ResNet-RS-200 256x256 93.4 83.5 96.6 config
ResNet-RS-270 256x256 130.1 83.6 96.6 config
ResNet-RS-350 256x256 164.3 83.7 96.7 config
ResNet-RS-350 320x320 164.3 84.2 96.9 config

Object Detection and Instance Segmentation

Common Settings and Notes

  • We provide models based on two detection frameworks, RetinaNet or Mask R-CNN, and two backbones, ResNet-FPN or SpineNet.
  • Models are all trained on COCO train2017 and evaluated on COCO val2017.
  • Training details:
    • Models finetuned from ImageNet pretrained checkpoints adopt the 12 or 36 epochs schedule. Models trained from scratch adopt the 350 epochs schedule.
    • The default training data augmentation implements horizontal flipping and scale jittering with a random scale between [0.5, 2.0].
    • Unless noted, all models are trained with l2 weight regularization and ReLU activation.
    • We use batch size 256 and stepwise learning rate that decays at the last 30 and 10 epoch.
    • We use square image as input by resizing the long side of an image to the target size then padding the short side with zeros.

COCO Object Detection Baselines

RetinaNet (ImageNet pretrained)

backbone resolution epochs FLOPs (B) params (M) box AP download
R50-FPN 640x640 12 97.0 34.0 34.3 config
R50-FPN 640x640 36 97.0 34.0 37.3 config

RetinaNet (Trained from scratch) with training features including:

  • Stochastic depth with drop rate 0.2.
  • Swish activation.
backbone resolution epochs FLOPs (B) params (M) box AP download
SpineNet-49 640x640 500 85.4 28.5 44.2 config | TB.dev
SpineNet-96 1024x1024 500 265.4 43.0 48.5 config | TB.dev
SpineNet-143 1280x1280 500 524.0 67.0 50.0 config | TB.dev

Mobile-size RetinaNet (Trained from scratch):

backbone resolution epochs FLOPs (B) params (M) box AP download
Mobile SpineNet-49 384x384 600 1.0 2.32 28.1 config

Instance Segmentation Baselines

Mask R-CNN (ImageNet pretrained)

Mask R-CNN (Trained from scratch)

backbone resolution epochs FLOPs (B) params (M) box AP mask AP download
SpineNet-49 640x640 350 215.7 40.8 42.6 37.9 config

Video Classification

Common Settings and Notes

  • We provide models for video classification with two backbones: SlowOnly and 3D-ResNet (R3D) used in Spatiotemporal Contrastive Video Representation Learning.
  • Training and evaluation details:
    • All models are trained from scratch with vision modality (RGB) for 200 epochs.
    • We use batch size of 1024 and cosine learning rate decay with linear warmup in first 5 epochs.
    • We follow SlowFast to perform 30-view evaluation.

Kinetics-400 Action Recognition Baselines

model input (frame x stride) Top-1 Top-5 download
SlowOnly 8 x 8 74.1 91.4 config
SlowOnly 16 x 4 75.6 92.1 config
R3D-50 32 x 2 77.0 93.0 config

Kinetics-600 Action Recognition Baselines

model input (frame x stride) Top-1 Top-5 download
SlowOnly 8 x 8 77.3 93.6 config
R3D-50 32 x 2 79.5 94.8 config