The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and/or pyramid design remain a key factor for their empirical success. In this paper, we show that this reliance on either feature pyramids or an hierarchical backbone is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. The plain architecture allows SimPLR to effectively take advantages of self-supervised learning and scaling approaches with ViTs, yielding competitive performance compared to hierarchical and multi-scale counterparts. We demonstrate through our experiments that when scaling to larger ViT backbones, SimPLR indicates better performance than end-to-end segmentation models (Mask2Former) and plain-backbone detectors (ViTDet), while consistently being faster. The code will be released.
翻译:暂无翻译