Transformer recently has shown encouraging progresses in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (abbreviated as PVTv1) by adding three designs, including (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers. With these modifications, our PVTv2 significantly improves PVTv1 on three tasks e.g., classification, detection, and segmentation. Moreover, PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT .
翻译:最近,变异器在计算机愿景方面取得了令人鼓舞的进展。 在这项工作中,我们通过改进原始的金字塔愿景变异器(以PVTv1为缩放)来展示新的基线,方法是增加三种设计,包括:(1) 重叠的补丁嵌入,(2) 进料向前网络,(3) 线性复杂关注层。有了这些修改,我们的PVTv2在分类、检测和分割等三项任务上大大改进了PVTv1。此外,PVTv2取得了比Swin变异器等近期工程的类似或更好的性能。我们希望这项工作将促进计算机愿景中最先进的变异器研究。代码可在https://github.com/whai362/PVT上查阅。