Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt
翻译:检测变压器是第一个完全端到端的物体探测学习系统,而视觉变压器则是第一个完全变压器的图像分类结构。在本文中,我们整合了视觉和探测变压器(VIDT),以建立一个有效和高效的物体探测器。 VIDT引入了重新配置的注意模块,将最近的Swin变压器扩大为独立的物体探测器,随后是计算效率高的变压器解码器,利用多种规模的变压器和辅助技术来提高探测性能,而不会大大增加计算负荷。微软COCO基准数据集的广泛评价结果表明,VIDT在现有的完全变压器物体探测器中获得了最佳的AP和耐用量交换,并实现了49.2AP,因为它对大型模型具有高度的可变性能。我们将在 https://github.com/naver-ai/vidt发布代码和经过培训的模型。我们将在 https://github. com/naver-i/vidt。