Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at \href{https://github.com/ViTAE-Transformer/QFormer}{QFormer}.
翻译:基于窗口的注意力由于具有卓越的性能、较低的计算复杂度和较少的内存占用而在视觉变换器中变得流行。然而,手工设计窗口不考虑数据,限制了转换器适应不同大小、形状和方向的对象的灵活性。为了解决这个问题,我们提出了一种新的四边形注意力 (QA) 方法,它将基于窗口的注意力扩展到广义的四边形公式。我们的方法采用一种端到端可学习的四边形回归模块,该模块预测一个转换矩阵,将默认窗口转换为目标四边形,以进行令牌采样和注意力计算,使网络能够对不同形状和方向的各种目标建模并捕捉丰富的上下文信息。我们将 QA 集成到普通的和分层的视觉变换器中,创建了一种新的架构,名为 QFormer,它提供了轻微的代码修改和可忽略的额外计算成本。在公共基准测试中进行了广泛的实验,表明 QFormer 在各种视觉任务上均优于现有的代表性视觉变压器,包括分类、物体检测、语义分割和姿态估计。代码将在 \href{https://github.com/ViTAE-Transformer/QFormer}{QFormer} 公开可用。