Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
翻译:近期变换器在计算机视觉中取得了令人鼓舞的进展。在这项工作中,我们通过添加三种设计(包括(1)线性复杂度的注意力层,(2)重叠的补丁嵌入和(3)卷积前馈网络)来改进原始的金字塔视觉变换器(PVTv1),从而提出了新的基线。通过这些修改,PVTv2将PVTv1的计算复杂度降低到线性,并在基本的视觉任务(如分类、检测和分割)方面取得了显著的改进。值得注意的是,所提出的PVTv2在与最近的作品(如Swin Transformer)相比方面实现了可比或更好的性能。我们希望这项工作能够促进计算机视觉中的最先进Transformer研究。代码可在 https://github.com/whai362/PVT 找到。