Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models. The source code and trained models are available at https://github.com/naver-ai/vidt.
翻译:检测变压器是第一个完全端到端的物体探测学习系统,而视觉变压器则是第一个基于全变压器的图像分类结构。在本文中,我们整合了愿景和检测变压器(VIDT),以构建一个有效和高效的物体探测器。VIDT引入了一个重新配置的注意模块,将最近的Swin变压器扩大为独立的物体探测器,随后是计算效率高的变压器解码器,该变压器将利用多种规模的特性和辅助技术提高探测性能,而计算负荷却不会大幅增加。此外,我们将其扩展至VIDT+,以支持在物体探测和实例分解方面进行联合任务学习。具体地说,我们附加了一个高效的多尺度特性聚合层,并使用了两个更多的辅助性培训损失,即IoU-aware损失和象征性标签损失。微软COCO基准数据集的广泛评价结果表明,VIDT在现有的完全变压器物体探测器探测器中获得了最佳的AP和纬度交易。其扩展的VIDT-ADDVDVA+DADA+在高可调制码源。