Today's probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g., perplexity. This discrepancy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language generation as a discrete stochastic process--which allows for an information-theoretic analysis--can provide new insights into the behavior of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, aiming to do so in a simultaneously efficient and error-minimizing manner; in fact, psycholinguistics research suggests humans choose each word in a string with this subconscious goal in mind. We formally define the set of strings that meet this criterion: those for which each word has an information content close to the expected information content, i.e., the conditional entropy of our model. We then propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models, which we call locally typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, locally typical sampling offers competitive performance (in both abstractive summarization and story generation) in terms of quality while consistently reducing degenerate repetitions.
翻译:今天的概率语言生成者在制作连贯和流畅的文本时,概率语言的生成者不尽如人意,尽管基础模型在标准的衡量标准(例如,不易理解)下表现良好。这一差异使语言生成者过去几年来感到迷惑不解。在这项工作中,我们假设自然语言生成过程的抽象化是一个离散的随机过程,允许信息-理论分析-能够对概率语言生成者的行为提供新的洞察力,例如,为什么高概率文本可以枯燥或重复。人类使用语言作为传播信息的手段,目的是同时以高效和错误最小化的方式这样做;事实上,精神语言学研究表明,人类选择每个字串与这一潜意识目标相一致。我们正式确定符合这一标准的一组关系:每个字的信息内容接近预期的信息内容,即我们模型的有条件的读写器。然后我们提出一个简单有效的程序,在从典型的概率、具有竞争力的抽样模型到当地提供的最高级的样本中,在从标准级的样本取样中生成这一标准。我们要求从高层次的样本和高层次的样本中,从高层次的样品取样到当地进行。