Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at https://github.com/whai362/PVT.
翻译:虽然使用连锁神经网络(CNNs)作为主干网在计算机视觉中取得了巨大成功,但这项工作调查了一个简单的主干网,对许多密集的预测任务有用,而没有连锁。与最近推出的为图像分类而专门设计的变异器模型(如ViT)不同,我们提议采用Pyramid View Fanger ~(PVT),它克服了将变异器移植到各种密集的预测任务中的困难,而没有将变异器移植到各种密集的预测任务中来。 PVT与以往各条相比有若干优点。 (1) 与VIT不同,它通常具有低分辨率产出,高计算和记忆成本。 PVT不能仅仅在图像密集的密集分割上接受培训,以达到高输出分辨率解析,这对于密集的预测很重要,而且还使用逐渐缩小的金字塔来减少大特性地图的计算。 (2) PVT继承了CNN和变异变器的优势,使它成为各种视觉任务的统一主干网,不用的骨干。 (3) 我们通过进行广泛的实验来验证PVT,显示它能提升许多下游任务的业绩,例如:物体探测、物体探测、地震网络+网络的网络的精确度、网络的精确度、网络和网络,在将来的索引化数据分析。