Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
翻译:在自然语言处理成功之后,变压器最近表现出了对计算机视觉的很大希望。变压器背后的自我注意操作使所有符号(即文字或图像补丁)之间产生全球互动,从而使得图像数据建模的灵活,超越了融合的本地互动。然而,这种灵活性伴随着时间和记忆的四变复杂度,阻碍了对长序列和高分辨率图像的应用。我们提出了一种“转换”式的自我注意版本,该版本在特性频道而不是符号之间运作,而这种互动是基于钥匙和查询之间的交叉变量矩阵。由此产生的交叉变量关注(XCA)在符号数量上具有线性复杂性,并允许高效处理高分辨率图像。我们的交叉变量图像变异器(XCA)建在XCA上。它将常规变异变器的精度与变压结构的可缩缩缩性结合起来。我们验证了XCiT的有效性和一般性,方法是报告多个视觉基准的极好结果,包括图像分类和自我监控的图像分类以及图像网络- CD- 和图像- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD- CD-