Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.
翻译:使用所有输入符号之间的对称关注模式, 模拟远程背景。 在这样做时, 它们假定由单个符号( 如文字字符或图像像素) 定义的固定关注颗粒, 这对于在更高层次建模复杂的依赖性可能不是最佳的模型。 在本文中, 我们提出“ 背景pool ” 来通过调整每个符号的注意颗粒来解决这个问题。 受ConvNet的成功激励, 与集合以捕捉远程依赖性相结合的ConvNet, 我们学会在计算特定注意层的注意之前, 将每个符号的相邻特征集中在一起。 集合的重量和支持大小是适应性的, 允许集合的特性以不同规模对有意义的环境进行编码。 我们显示, ConcernPool 的注意模式更清晰, 实现强的功能, 通常以较少的层实现, 从而大幅降低成本。 实验证实, 当连接到变异模型时, 匹配或超过状态的功能时, 使用几种语言和图像基准, 超越最近以学习过的上下文大小或分散的注意模式, 也适用于 Convol 。