The rise of transformers in vision tasks not only advances network backbone designs, but also starts a brand-new page to achieve end-to-end image recognition (e.g., object detection and panoptic segmentation). Originated from Natural Language Processing (NLP), transformer architectures, consisting of self-attention and cross-attention, effectively learn long-range interactions between elements in a sequence. However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features. This subsequently impedes the learning in cross-attention between pixel features and object queries. In this paper, we rethink the relationship between pixels and object queries and propose to reformulate the cross-attention learning as a clustering process. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only improves the state-of-the-art, but also enjoys a simple and elegant design. As a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, and Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU without test-time augmentation or external dataset. We hope our work can shed some light on designing transformers tailored for vision tasks. Code and models are available at https://github.com/google-research/deeplab2
翻译:变压器在视觉任务中的崛起不仅提升了网络主干设计,而且开始了一个崭新页面,以实现端到端图像识别(例如,物体探测和全光分割)。来自自然语言处理(NLP)的变压器结构,由自我注意和交叉注意组成,有效地学习各元素之间按顺序进行远程互动。然而,我们看到,大多数基于变压器的视觉模型只是借用NLP的理念,忽略了语言和图像之间的关键差异,特别是空间平整像素特性的超大序列长度。这随后阻碍了在像素特性和对象查询之间的交叉注意学习。在本文件中,我们重新思考像素和对象查询之间的关系,并提议重新将交叉注意学习作为一个组合过程。受传统的 k means 组合算法的启发,我们为分解任务开发了 kpoint Mask Xfor (kMaX-DescausalLab) (kMax-DevelopLab), 不仅改进了空间平坦平坦平坦的像素模型,而且还实现了一种简单和优雅度的 Cal-al-laxalal-la la la-deal-deal-la maxal-la maxal maxal) maxal maxal maxal maxal maxal maxalalalalalalalalalalalalalalalalalalalalal Qal maxal maxal maxal maxal maxalal masal maxal maxal 。