We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. Inheriting the patch/window idea from ViT and its follow-ups, the doughnut kernel enhances the design of patches. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its performance on ImageNet 1K surpasses the Swin Transformer (+0.7 acc1).
翻译:在本文中,我们提出了一个由新的甜甜圈内核组成的新架构,即“模式关注变换器”。与NLP字段中的标志相比,计算机视野中的变换器有处理图像像素高分辨率的问题。从ViT及其后续中继承了补丁/窗口的构想,甜甜圈内核加强了补丁的设计。它用两类区域取代了线切界限:传感器和更新,它基于对自我注意的理解(称为QKVA网格) 。甜甜甜圈内核还带来了一个关于内核形状的新主题。为了验证其在图像分类上的性能,PAT设计时使用了普通八角形甜甜圈的变形块。它在图像网1K上的性能超过了Swin变压器(+0.7 acc1)。