Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.
翻译:标准自动递减语言模型只使用多式时间计算来计算下一个符号的概率。 虽然这具有吸引力, 但它并不意味着它们不能模拟下一个符号概率难以计算的分布。 事实上, 它们甚至不能充分模拟这些分布, 以解决工程师可能想要咨询语言模型的相关的简单决定问题。 这些限制适用于无论使用多少计算和数据来训练模型, 除非模型能够访问在序列长度上增长超极速的奥克莱参数。 因此, 简单的培训更大的自动递减语言模型并不是NLP的灵丹妙药。 替代方法包括基于能源的模型( 提供高效抽样) 和潜在的可变性自动递减模型( 放弃给给给给给给定的字符串的高效评分 ) 。 这两种方法都足够强大,足以避免上述限制 。