Modern object detectors have taken the advantages of pre-trained vision transformers by using them as backbone networks. However, except for the backbone networks, other detector components, such as the detector head and the feature pyramid network, remain randomly initialized, which hinders the consistency between detectors and pre-trained models. In this study, we propose to integrally migrate the pre-trained transformer encoder-decoders (imTED) for object detection, constructing a feature extraction-operation path that is not only "fully pre-trained" but also consistent with pre-trained models. The essential improvements of imTED over existing transformer-based detectors are twofold: (1) it embeds the pre-trained transformer decoder to the detector head; and (2) it removes the feature pyramid network from the feature extraction path. Such improvements significantly reduce the proportion of randomly initialized parameters and enhance the generation capability of detectors. Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by ~2.8% AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP, demonstrating significantly higher generalization capability. Code will be made publicly available.
翻译:现代天体探测器利用经过预先训练的视像变压器作为主干网的优势。 但是,除了主干网外,其他探测器部件,例如探测器头和特征金字塔网络等,仍然随机初始化,这妨碍了探测器和经过训练的模型的一致性。在本研究中,我们提议将经过训练的变压器编码解码器(模拟)进行整体迁移,以探测物体,建造一个不仅“经过充分训练的”而且与经过训练的模型相一致的特征提取操作路径。对基于变压器的现有探测器的基本改进是双重的:(1) 它将预先训练的变压器解密器嵌入探测器头部;(2) 它将特性金字塔网络从特性提取路径上移走。这些改进大大降低了随机初始化参数的比例,并提高了探测器的生成能力。对 MS COCOCO数据集的实验表明,不完全化的脱轨使其对应方达到~2.8 % AP。没有钟和哨,它改进了现有的变压器的状态,将大大改进了少数物体的普通探测能力。