Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
翻译:最近,视觉变压器通过推动各种视觉任务的最先进技术而取得了巨大成功。视觉变压器中最具挑战性的问题之一是,图像符号的序列长度大导致高计算成本(水成复杂度)。这个问题的流行解决办法是使用单一集成操作来缩短序列长度。本文件考虑如何改进现有的视觉变压器,因为单个集成操作所提取的集合功能似乎不太强大。为此,我们注意到金字塔集资已证明在各种视觉任务中是有效的。然而,在主干网设计中,没有探索金字塔汇合。为了缩小这一差距,我们提议在视觉变压器中,将金字塔集合调整成多头自控(MHSA),同时缩短序列长度并捕捉强大的背景特征。我们与我们基于联营的MHSA(MHA)一道,我们建立了一个通用的视觉变压器骨架,并被涂了金字型Pyramid Pyramwepoolinger(P2T)。广泛的实验表明,在应用P2T目标网中,当应用P2T(主干网)时,它作为主干网格网络时,将显示,将大量的图像变压/变压/变压网络进行对比,将显示,将显示,在前视像部分的图像变压/图像分类中,将显示,将大量变压。在前的图像的图像变压路段中,将显示,将显示,将大量变压成等图像变压的图像的图像变压。