Transformer has achieved great success in computer vision, while how to split patches in an image remains a problem. Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. We term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and conduct extensive evaluations of DPT on image classification and object detection. Results show DPT can achieve 81.9% top-1 accuracy on ImageNet classification, and 43.7% box mAP with RetinaNet, 44.3% with Mask R-CNN on MSCOCO object detection. Code has been made available at: https://github.com/CASIA-IVA-Lab/DPT .
翻译:计算机变换器在计算机视觉方面取得了巨大成功, 而如何在图像中分割补丁仍是一个问题。 现有的方法通常使用固定大小的补丁嵌入模块, 可能会破坏对象的语义学。 为了解决这个问题, 我们提议一个新的可变换的补丁( DePatch) 模块, 该模块可以以数据驱动的方式, 适应性地将图像分成不同位置和比例的补丁, 而不是使用预定义的固定补丁。 这样, 我们的方法可以在补丁中保存语义。 DePatch 模块可以作为一个插件和游戏模块发挥作用, 可以很容易地融入不同的变异器, 从而实现终端到终端的培训 。 我们将这个可变异的调配制变异器命名为可变换的补丁基变换器( DPT), 并对图像分类和对象检测的DPT进行广泛的评价。 结果显示 DPT 在图像网络分类上可以达到81.9%的最高-1 精确度, 在 RetinaNet 上可以达到43. 7% 框 mAP, 在 MS K R- CN 目标检测上使用 MSCOC R- IV/ CASAB/ CASALDP 。 。 代码已经在 http http:// 。