Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.
翻译:增加输入长度是变压器语言建模进步的驱动力。 我们找出了较短输入无害的条件, 并通过两种新的方法实现了混乱和效率改进, 从而降低输入长度。 首先, 我们显示, 最初先在短次序列上培训一个模型, 然后再进行长次序列, 一方面会减少整个培训时间, 令人惊讶的是, 还会大大地改善变压器的重复处理方法。 其次, 我们展示了如何提高变压器中重现方法的效率, 使模型在生成超过变压器所能同时处理的最大长度的序列时, 以先前处理过的符号为条件。 现有方法需要计算昂贵的相对位置嵌入; 我们引入了一个简单的替代方案, 即添加绝对位置嵌入查询和键, 而不是插入文字嵌入, 从而产生更高的效果。 我们显示, 这些重复模式也得益于短输入长度。 将这些技术加速培训的速度乘以1.65 系数, 减少记忆用量, 并大幅改进WikText- 103 103 的不增加参数 。