Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence. This new framework reveals that exact softmax attention can be recovered from RFA by manipulating each control variate. Besides, it allows us to develop a more flexible form of control variates, resulting in a novel attention mechanism that significantly reduces the approximation gap while maintaining linear complexity. Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.
翻译:基于随机地物的注意(RFA)是线性运行时间和空间复杂度的软性注意的有效近似值,然而,对RFA和常规软性注意之间的近似差距的研究不够充分。根据RFA以前的进展,我们用控制变异的透镜来描述这一差距,并表明RFA可以分解成每个元素序列的多重控制变异估计器。这个新框架显示,通过操纵每一种控制变异,可以从RFA中恢复精确的软性注意。此外,它使我们能够发展一种更灵活的控制变异形式,从而形成一种新的注意机制,大大缩小近似差距,同时保持线性复杂性。广泛的实验表明,我们的模型在视觉和语言任务上都超越了最先进的有效注意机制。