Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our model for reproducibility purposes in Jax.
翻译:最近,作为自然语言处理的基础模型之一,作为副产品,变换模型最近成为自然语言处理的基础模型之一,最近对这些模型的推广产生了很大的兴趣和投资。然而,这些大型变换语言模型的培训和推论成本令人望而却步,因此有必要进行更多的研究,以确定效率更高的变体。在这项工作中,我们提议对由统计语言建模文献所启发的变换结构进行简单而有效的修改,办法是通过从文本序列的离散潜在代表形式中构建的正克来扩大模型。我们评估了我们的模型,即C4数据集的语言建模以及超级GLUE数据集的文本分类的N-Grammer,并发现它超越了诸如变换器和总理等几个强有力的基准。我们为Jax的再生用途开发了我们的模型。