Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model with a recurrence memory. However, existing approaches directly reuse hidden states from the previous segment that encodes contexts in a uni-directional way. As a result, this prohibits the memory to dynamically interact with the current context that provides up-to-date information for token prediction. To remedy this issue, we propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens, and interpolating with the old memory states to maintain long-term information in the history. LaMemo embraces bi-directional attention and segment recurrence with an additional computation overhead only linearly proportional to the memory length. Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
翻译:虽然具有完全连接的自我意识的变异器对于模拟长期依赖关系非常强大,但是它们正努力在语言模型中将长文本缩到数千字的长文本中。 其中一个解决办法是使模型具有重复记忆。 但是, 现有的方法直接再利用前一个部分的隐藏状态, 以单向方式将背景编码。 因此, 禁止记忆与当前为象征性预测提供最新信息的背景动态互动。 为了纠正这一问题, 我们提议 Look- Ahead Memory (LaMemo), 逐步关注右侧的符号, 并与旧记忆状态相互推介, 以保持历史中的长期信息, 从而增强重复记忆的记忆。 LaMemo 包含双向关注和段重现, 并额外计算与记忆长度成直线成正比的顶端。 对广泛使用的语言模型基准的实验表明它优于配备不同记忆的基线。