Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.
翻译:最近,变换器在各种视觉任务中表现出了有希望的性能。为了减少每个查询涉及所有关键/价值而引发的二次计算复杂性,各种方法限制了当地区域的关注范围,因为每个查询只关注手制窗口中的键/价值。然而,这些手工制作的窗口分割机制是数据不可知的,忽视了输入内容,因此,可能有一个查询可能关注无关的键/价值。为了解决这个问题,我们提议一个动态群注意(DG-Attention),将所有查询动态组分为多个组,并为每个组选择最相关的键/价值。我们的DG-Atention可以灵活地模拟更相关的依赖性,而没有基于注意的手工制作窗口中所使用的任何空间限制。在DG-Atention的基础上,我们开发了一个名为动态组变换器(DGT)的通用变压器主干。广泛的实验显示,我们的模型可以超越多种共同视觉任务上的状况式方法,包括图像分类、语义分解、对象探测和像形分割。