Without densely tiled anchor boxes or grid points in the image, sparse R-CNN achieves promising results through a set of object queries and proposal boxes updated in the cascaded training manner. However, due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention, which is usually inaccurate in the early training stage. Moreover, in a scene of dense objects, the object query interacts with many irrelevant ones, reducing its uniqueness and harming the performance. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention. The original attention matrix multiplies the same size matrix computed from the IoU of proposal boxes, and they determine the routing scheme so that the irrelevant features can be suppressed. Furthermore, to accurately extract features for both classification and regression, we add two lightweight projection heads to provide the dynamic channel masks based on object query, and they multiply with the output from dynamic convs, making the results suitable for the two different tasks. We validate the proposed scheme on different datasets, including MS-COCO and CrowdHuman, showing that it significantly improves the performance and increases the model convergence speed.
翻译:在图像中,没有堆积的紧凑锚箱或网格点,微弱的R-CNN通过按级培训方式更新的一组对象查询和提议框,取得了有希望的成果。然而,由于查询与所选区域之间稀少的性质以及一对一的关系,它在很大程度上取决于自我关注,而在早期培训阶段,这种关注通常不准确。此外,在密集物体的场景中,物体查询与许多无关的物体相互作用,降低其独特性并损害性能。本文件提议使用不同方框之间的IOU作为自我注意的值路由前一种。最初的注意矩阵将从提议框的IoU计算出的相同大小矩阵乘以相同的大小矩阵,从而决定了不相干的特点。此外,为了精确地提取分类和回归的特征,我们增加了两个轻量的投影头,以提供基于对象查询的动态通道遮罩,它们与动态连接的输出相乘,使结果适合两种不同的任务。我们验证了不同的数据集的拟议计划,包括MS-CO和CrowdHuman,我们验证了不同的模型,以大大地改进了它的速度。