Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.
翻译:神经语言模型通常将输入文本象征性化到子词组中, 以获得开放词汇。 标准方法是在火车和测试时间使用单一的卡通符号化。 我们建议, 这种方法不令人满意, 可能会阻碍我们对语言模型性能的评估。 仅使用一个最佳象征性化就忽略了代用符号的象征化不确定性, 这可能会伤害模型外外外的性能。 在本文中, 我们争论说, 语言模型应该评估其相对于代用符号的边际可能性。 我们比较了基于取样的不同估计者, 并表明以可控制数量的样本来估计边际可能性是可行的。 我们然后对英语和德语的单一最佳化和边际曲解模式进行评估, 并表明边际的曲解可能比最佳的要好得多, 特别是外边际数据。 我们把这一差异与代用代用符号英特罗普测量的象征性不确定性联系起来。 我们讨论我们的结果对语言模型培训和评价的一些影响, 特别是象征性坚固度。