Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVTv1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVTv2 reduces the computational complexity of PVTv1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
翻译:在这项工作中,我们通过增加三种设计,包括:(1) 线性复杂关注层,(2) 重叠的补丁嵌入,(3) 进化进化进化网络,提出了新的基线。有了这些修改,PVTv2将PVTv1的计算复杂性降低到线性,并在分类、检测和分割等基本愿景任务上取得了重大改进。值得注意的是,拟议的PVTv2取得了与Swin变换器等近期工程相似或更好的业绩。我们希望这项工作将促进计算机愿景中最先进的变异器研究。代码可在https://github.com/whai362/PVT上查阅。