The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
翻译:迄今为止,语言建模的先发制人方法以经常性神经网络为基础,他们在这项工作上的成功往往与它们捕捉不受约束的环境的能力有关。在本文中,我们通过堆叠的混杂发展一种有限背景方法,因为这种方法可以使顺序的标牌平行化,因此效率更高。我们建议一种新的简化机制,它优于Oord等人(2016年),并调查主要建筑决定的影响。拟议方法在WikitText-103基准上达到了最新水平,尽管它具有长期依赖性,以及在谷歌十亿字基准上的竞争结果。我们的模型降低了延绳,比经常基线的长度要低。据我们所知,这是非经常性方法首次在大型语言任务上具有强大的反复模式具有竞争力。