Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods. The code is publicly available at \url{https://github.com/moabarar/qna}.
翻译:视觉变异器( VIT) 是一种强大的视觉模型。 与前几年以视觉研究为主的卷变神经网络不同, 视觉变异器拥有在数据中捕捉远距离依赖性的能力。 然而, 任何变异器结构( 自我注意机制) 的内在组成部分, 都存在高度的潜伏和低效率的内存利用, 使得它更不适合高分辨率输入图像。 为了减轻这些缺陷, 等级级的视觉模型在当地对非互动窗口使用自我关注。 这种放松会降低输入大小的线性的复杂性; 但是, 它会限制跨窗口的互动, 伤害模型的性能。 在本文中, 我们提出一个新的变异式本地关注层, 叫做查询和出席( QnA), 以重叠的方式汇总本地的投入, 类似变相。 QnA 的关键思想是引入学习的查询, 从而能够快速和高效地执行。 我们通过将它纳入一个分级的视觉变异器模型来验证我们层次的效能。 我们显示速度和记忆复杂性的改进, 同时实现与状态/ 的可比较的精确性- 水平/ x 模式, 而现在的层次的代码是比现有的代代号 10 。 需要更快的比现有的 格式 格式 格式 。