Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
翻译:最近,视觉变压器通过推动各种视觉任务的最先进技术而取得了巨大成功。视觉变压器中最具挑战性的问题之一是,成像标牌的序列长度大导致计算成本高(水晶复杂程度 ) 。 这个问题的流行解决办法是使用单一集资操作来缩短序列长度。 本文考虑如何改进现有的视觉变压器, 由单一集资操作所提取的集合功能似乎不太强大。 为此, 我们注意到金字塔集资已证明在各种视觉任务中是有效的。 然而, 金字塔汇合在主干网设计中并未探索金字塔联结。 为了缩小这一差距,我们提议在视觉变压器中将金字塔联结调整为多头自控(MHSA ), 同时缩短序列长度并捕捉强大的背景特征。 与我们基于共用的MHSA 的集合功能混在一起, 我们建造了一个通用的视觉变压器骨架, 虚构的Pyramid Pyramid pooling 变压器(P2T ) 。 广泛的实验表明, 当应用P2T作为主干网格网络时, 当应用P2T 进行主干网络时, 检测时, 和图像变压时, 将显示, 变压式网络将大量变压器将显示, 变压成为以前变压式的图像部分, 变压路段的变压路段将显示, 等图像路段, 等图像路路段的变压器 。