Recent self-supervised methods are mainly designed for representation learning with the base model, e.g., ResNets or ViTs. They cannot be easily transferred to DETR, with task-specific Transformer modules. In this work, we present Siamese DETR, a Siamese self-supervised pretraining approach for the Transformer architecture in DETR. We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks, i.e., localization and discrimination, in a novel multi-view learning framework. Two self-supervised pretext tasks are designed: (i) Multi-View Region Detection aims at learning to localize regions-of-interest between augmented views of the input, and (ii) Multi-View Semantic Discrimination attempts to improve object-level discrimination for each region. The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection using different DETR variants in all setups. Code is available at https://github.com/Zx55/SiameseDETR.
翻译:Siamese DETR 是一种 Siamese 自监督预训练方法,为 DE(tection)TR(ansformer) 提供了具有任务特定 Transformer 模块的表示学习。该方法通过两个互补的任务,即定位和判别,同时考虑学习视角不变和定向的表示。本文提出了两个自监督预文本任务:(i) 多视角区域检测旨在学习在输入的增强视图之间定位感兴趣区域,(ii) 多视角语义判别尝试提高每个区域的物体级别判别能力。实验表明,该算法在不同DETR变体上都实现了 COCO 和 Pascal VOC 检测转移学习的最先进表现。代码可在 https://github.com/Zx55/SiameseDETR 获得。