In this paper, we propose a novel query design for the transformer-based detectors. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we can not explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focus on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 16 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at https://github.com/megvii-model/AnchorDETR.
翻译:在本文中, 我们为基于变压器的探测器提出一个新的查询设计。 在以前的变压器探测器中, 对象查询是一组学习过的嵌入器。 但是, 每个学习过的嵌入器都没有明确的物理意义, 我们无法解释它会关注的焦点。 由于每个对象查询的预测位置没有特定模式, 很难优化。 换句话说, 每个对象查询不会以特定区域为重点。 为了解决这些问题, 在我们的查询设计中, 对象查询以固定点为基础, CNN探测器中广泛使用的固定点为基础。 因此, 每个对象查询的焦点都集中在靠近主点的物体上。 此外, 我们的查询设计可以预测一个位置上的多个对象 : “ 一个区域, 多个对象 ” 。 此外, 我们设计了一个关注变量变式, 它可以降低记忆成本, 同时实现与DETR的正常工作。 由于查询设计和关注变量, 我们称之为 Anchor DETR, 的拟议探测器能够取得更好的性能, 并且运行比 DETR 10 美元 的频率为更少的培训 。 此外, 我们的查询器可以在一个位置上预测一个位置上预测到 AS2 ASAPSBSCO 。 在 ASBSAR5 数据库中, 的模型上, ASAP- greabreareablegreal