Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
翻译:具有多头关注的变换模型需要为高效的生成任务推断而累积中间结果。 然而,缓存带来与记忆有关的新成本,防止以更快的速度利用较大批量的规模。 我们建议对该问题给予记忆效率高的无损关注(称为EL注意 ) 。 它避免了建造多头键和值的繁重操作,不需要为它们缓存。 EL 注意通过扩大查询而使关注结果的组合化,同时保持关键和价值共享。 它产生与多头关注相同的结果,而GPU记忆的存储和更快的推断速度。 我们在变换器、 BART 和 GPT-2 上进行了广泛的实验, 用于合成和问题生成任务。 结果表明, EL 注意加速了现有模型的1.6x至5.3x, 但没有准确损失 。