Given a query patch from a novel class, one-shot object detection aims to detect all instances of that class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on COCO, VOC, and FSOD under one-shot settings demonstrate the effectiveness and efficiency of our method, e.g., it surpasses CoAE, a major baseline in this task by 1.0% in AP on COCO and runs nearly 2.5 times faster. Code will be available in the future.
翻译:鉴于来自一个新类的查询补丁,一次性物体探测的目的是通过语义相似性比较,在目标图像中检测该类的所有情况。然而,由于新类的指导极为有限,以及查询和目标实例之间无形的外观差异,很难适当地利用它们的语义相似性,并广泛归纳。为了缓解这一问题,我们提出了一个通用的跨注意力变换器模块,以便在一发物体探测中进行准确和高效的语义相似性比较。拟议的CAT利用变压器机制全面捕捉来自查询和目标图像的任何配对像素之间的双向通信,这使我们能够充分利用其语义特征进行精确的相似性比较。此外,拟议的CAT能够使特征维度压缩在不造成性能损失的情况下加速发酵速度。在一发式环境中对CO、VOC和FSOD进行的广泛实验表明我们的方法的有效性和效率,例如,它超过了COCOCO-CO-未来将使用近2.5倍的速度。