Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.
翻译:语言建模和生成的现有评价指标在很大程度上依赖于预测(或生成)字词的准确性,而参照地的真象。重要、象征性的准确性只捕捉语言模型行为的一个方面,忽视了语言模型行为的语言特性,可能使某些错误预测的象征物在实践中有用。此外,与预测准确性直接相关的统计数据(包括复杂性),可能由书面语言的Zipfian性质所掩盖,因为大多数预测尝试将经常发生类型。一个模型的性能在高频和低频字眼之间可能有很大差异,实际上可能导致失败模式,例如由语言模型下游消费者制作的重复和枯燥的文字。为了解决这个问题,我们提议在简单字数预测任务框架内采取两项新的内在评价措施,目的是更全面地描述语言模型的性能。我们用拟议指标对几个常用的大型英语模型进行评价,并表明我们的方法揭示了模型在性能上的功能差异,这些模型被更传统的指标所掩盖。