Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often underperform when used to generate text. This dichotomy has puzzled the language generation community for the last few years. In this work, we posit that the abstraction of natural language as a communication channel (\`a la Shannon, 1948) can provide new insights into the behaviors of probabilistic language generators, e.g., why high-probability texts can be dull or repetitive. Humans use language as a means of communicating information, and do so in an efficient yet error-minimizing manner, choosing each word in a string with this (perhaps subconscious) goal in mind. We propose that generation from probabilistic models should mimic this behavior. Rather than always choosing words from the high-probability region of the distribution--which have a low Shannon information content--we sample from the set of words with an information content close to its expected value, i.e., close to the conditional entropy of our model. This decision criterion can be realized through a simple and efficient implementation, which we call typical sampling. Automatic and human evaluations show that, in comparison to nucleus and top-k sampling, typical sampling offers competitive performance in terms of quality while consistently reducing the number of degenerate repetitions.
翻译:尽管在众多自然语言中实现了极低的迷惑,但今天的语言模式在用于生成文本时仍然往往表现不佳。 这种二分法在过去几年里使语言生成社区感到困惑。 在这项工作中,我们假设自然语言作为一个交流渠道的抽象化( ⁇ a la Shannon,1948)能够对概率语言生成者的行为提供新的洞察力,例如,为什么高概率文本可能会枯燥或重复。人类使用语言作为信息传递手段,并以高效但最差错的方式这样做,在其中选择一个字符串中的每一词,用这个(可能潜意识下)的目标来计算。我们建议,从概率模型中生成的版本应该模仿这种行为。我们建议,从高概率语言分布区( ⁇ a la la Shannon, 1948) 来选择语言,而不是总是从高概率语言生成的词,因为高频信息内容与预期值接近,即接近于我们模型的有条件的英特质的英特罗比。 可以通过简单而高效的实施来实现这一决定标准, 可以通过一个简单而高效的字符串(我们称之为典型的典型的抽样、 ) 和高压性核心评估来显示最典型的样品的复制。