Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) locally continuous features with convolutions, (2) position encodings with zero paddings, and (3) linear complexity attention layers with average pooling. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves much better performance than recent works, including Swin Transformer, under ImageNet-1K pre-training. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .
翻译:在这项工作中,我们改进了最初的金字塔愿景变异器(PVTv1),增加了三项改进设计,其中包括:(1) 本地连续功能与变异,(2) 位置编码与零垫接轨,(3) 线性复杂关注层与平均集合。有了这些简单的修改,我们的PVTv2大大改进了PVTv1在分类、检测和分割方面的功能。此外,PVTv2的绩效比最近的工程要好得多,包括在图像Net-1K培训前的Swin变异器。我们希望这项工作将使最先进的视觉变异器研究更容易被利用。代码可在https://github.com/whai362/PVT上查阅。