This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD), aiming at adapting the detector from source to target domain through weak supervision. We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism and are thus capable of aggregating semantics across the entire image. The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment. Such motivated, we propose DETR with additional Global Aggregation (DETR-GA), a CDWSOD detector that simultaneously makes "instance-level + image-level" predictions and utilizes "strong + weak" supervisions. The key point of DETR-GA is very simple: for the encoder / decoder, we respectively add multiple class queries / a foreground query to aggregate the semantics into image-level predictions. Our query-based aggregation has two advantages. First, in the encoder, the weakly-supervised class queries are capable of roughly locating the corresponding positions and excluding the distraction from non-relevant regions. Second, through our design, the object queries and the foreground query in the decoder share consensus on the class semantics, therefore making the strong and weak supervision mutually benefit each other for domain alignment. Extensive experiments on four popular cross-domain benchmarks show that DETR-GA significantly improves CSWSOD and advances the states of the art (e.g., 29.0% --> 79.4% mAP on PASCAL VOC --> Clipart_all dataset).
翻译:本文提出了一种基于DETR的多领域弱监督目标检测方法,旨在通过弱监督实现从源领域到目标领域的目标检测器的适应。我们认为DETR在多领域弱监督目标检测中具有潜力,因为其编码器和解码器都基于注意力机制,因此能够在整个图像上聚合语义信息。聚合结果即为图像级别的预测结果,自然地利用弱监督进行领域对齐。因此,我们提出了一种名为全局聚合DETR(DETR-GA)的多领域弱监督目标检测器,能够同时进行“实例级别+图像级别”的预测,并利用“强监督+弱监督”进行领域对齐。DETR-GA的关键点非常简单:对于编码器/解码器,我们分别添加多个类别查询/一个前景查询,以在图像级别上聚合图像语义。我们的基于查询的聚合方法有两个优点。首先,在编码器中,弱监督的类别查询能够大致定位相应位置并排除与不相关区域的干扰。其次,通过我们的设计,解码器中的物体查询和前景查询在类别语义上具有一致性,因此使得强监督和弱监督相互促进以实现域对齐。在四个流行的多领域基准测试中的广泛实验表明,DETR-GA显着提高了多领域弱监督目标检测的识别性能,取得了最新技术的进展(例如,PASCAL VOC --> Clipart_all数据集上将mAP从29.0%提高到了79.4%)。