October 2020
tl;dr: Improved DETR that trains faster and performs better to small objects.
Issues with DETR: long training epochs to converge and low performance at detecting small objects. DETR uses small-size feature maps to save computation, but hurt small objects.
Deformable DETR first reduces computation by attending to only a small set of key sampling points around a reference. It then uses multi-scale deformable attention module to aggregate multi-scale features (without FPN) to help small object detection.
Each object query is restricted to attend to a small set of key sampling points around the reference points instead of all points in the feature map.
Deformable DETR is one of the highest scored papers in ICLR 2021.
There are several papers on improving the training speed of DETR.
- Deformable DETR: sparse attention
- TSP: sparse attention
- Sparse RCNN: sparse proposal and iterative refinement
- Efficient Attention
- Pre-defined sparse attention patterns.
- Learn data-dependent sparse attention --> Deformable DETR belongs to this
- Low rank property in self-attention
- Complexity of DETR
- Encoder: self attention
$O(H^2W^2C)$ , quadratically with feature size. - Decoder: cross attention
$O(HWC^2 + NHWC)$ , linearly with feature size. Self-attention$O(2NC^2+N^2C)$
- Encoder: self attention
- Summary of technical details
- Questions and notes on how to improve/revise the current work