This paper jointly resolves two problems in vision transformer: i) the computation of Multi-Head Self-Attention (MHSA) has high computational/space complexity; ii) recent vision transformer networks are overly tuned for image classification, ignoring the difference between image classification (simple scenarios, more similar to NLP) and downstream scene understanding tasks (complicated scenarios, rich structural and contextual information). To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful context abstraction, and its natural property of spatial invariance is suitable to address the loss of structural information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA for alleviating its high requirement on computational resources (problem i)). In this way, this pooling-based MHSA can well address the above two problems and is thus flexible and powerful for downstream scene understanding tasks. Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various downstream scene understanding tasks such as semantic segmentation, object detection, instance segmentation, and visual saliency detection, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T. Note that this technical report will keep updating.
翻译:本文共同解决了视觉变压器的两个问题:(一) 多头自我保护(MHSA)的计算具有很高的计算/空间复杂性;(二) 最近的视觉变压器网络过于适应图像分类,忽视了图像分类(简单情景,更类似于NLP)和下游现场理解任务(复杂情景、丰富的结构和背景信息)之间的差异。为此,我们注意到金字塔集合已证明在各种视觉任务中是有效的,因为它具有强大的背景抽象,而其空间变异的自然属性适合于解决结构信息损失(问题二)。 因此,我们提议将金字塔集合调整为MHSA,以减轻其对计算资源(问题一)的高度需求。 这样,这种基于集合的MHSA可以解决上述两个问题,因此对于下游场理解任务具有灵活性和力量。 与我们基于联合的MHSA一道,我们建立了一个面向下游塔克的变压器网络,调压了Pyramid 变压器(P2T)的自然特性。我们提议将金字串联改编改动到Myramid Gloarrial-dealisal latial laudal laudal laudal laudal laud laud laud laud laud lauds laud lauds laud lauds laud lauds laud laud laubal labal labal lauds labal laud lauds labit labild labal labildow labild labild labal labil labild lauds lauds labal labild labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal labal 将持续演演演演演演演演演演演演演演演演演演演演演演演