In this paper, we propose a novel query design for the transformer-based object detection. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we cannot explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focuses on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 19 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at \url{https://github.com/megvii-research/AnchorDETR}.
翻译:在本文中, 我们为基于变压器的天体探测建议一个新的查询设计。 在先前的变压器探测器中, 对象查询是一组有知识的嵌入器。 但是, 每个所学的嵌入器并不具有明确的物理意义, 我们无法解释它的焦点。 由于每个对象查询的预测位置没有特定模式, 很难优化。 换句话说, 每个对象查询不会以特定区域为重点。 为了解决这些问题, 在我们的查询设计中, 对象查询以固定点为基础, CNN探测器广泛使用的固定点为基础。 因此, 每个对象查询的焦点都集中在离主点附近的天体上。 此外, 我们的查询设计可以预测一个位置上的多个对象 : “ 一个区域, 多个对象 ” 。 此外, 我们设计了一个关注变异种, 它可以降低记忆成本, 而同时实现与DETRTR的正常工作。 由于查询设计和关注变量, 我们称之为 Anchor DETRTR, 的探测器可以取得更好的性能, 并且运行比 DETR, 10$ CO 更少的培训费用 。 此外, 我们的查询设计可以在一个位置上预测多个物体 。 例如, 我们的 AS2 AS2 AS2, AS2 ASBSAR5 数据库中的数据测试中, 。