Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-$p$ unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce $\eta$-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, $\eta$-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.
翻译:神经语言模型中长长的文本样本质量可能很差。 剪线抽样算法- 类似顶价- p$或顶价- k$, 将某些单词的概率设定为每步零, 从而解决这个问题。 这项工作为脱线目的提供了框架, 并改进了用于此目的的算法。 我们建议将神经语言模型视为真实分布和平稳分布的混合体, 避免无限的两重性。 从这个角度讲, 脱线算法旨在执行脱线, 估计真实分布支持的子集。 找到一个好的子集至关重要 : 我们显示, 上价- p$ 不必要地将高概率的单词调为零 。 例如, 导致它跳出所有单词, 但 Trump 用于从唐纳德开始的文件 。 我们引入了 $\eta$- sampling, 将单词调出一个离线, 避免无限的双重概率阈值。 与以前的算法相比, $\eta$- sampling 相比, 生成一个比人类更可信的长的英文文档, 更好避免重复, 并进行更合理的测试电池的分布 。