Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM) of~\citet{Bengio2003ANP}, which simply concatenates word embeddings within a fixed window and passes the result through a feed-forward network to predict the next word. When scaled up to modern hardware, this model (despite its many limitations) performs much better than expected on word-level language model benchmarks. Our analysis reveals that the NPLM achieves lower perplexity than a baseline Transformer with short input contexts but struggles to handle long-term dependencies. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer, which results in small but consistent perplexity decreases across three word-level language modeling datasets.
翻译:语言建模方面最近的进展不仅受到神经结构进步的驱动,而且受到硬件和优化改进的驱动。在本文中,我们重新审视了“citet{Bengio2003ANP}”的神经概率语言模型(NPLM),该模型将单词嵌入一个固定窗口,并通过一个反馈前网络将结果传递到下一个词的预测。在升级到现代硬件时,该模型(尽管有许多局限性)在字级语言模型基准上的表现比预期的要好得多。我们的分析显示,NPLM比一个有短期输入背景但难以处理长期依赖性的基线变异器(NPLM)实现的更低的多。受此结果的启发,我们修改了变异器,将其第一个自留层替换为NPLM的本地配置层,这导致三个字级语言建模数据集的微但一致的易解度下降。