While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.
翻译:虽然大规模预训练语言模型(PLMs)表现出在许多自然语言任务上的惊人流畅性和性能,但最近的工作表明,表现良好的PLMs对输入给定的提示非常敏感。即使提示在语义上相同,语言模型的答案也可能非常不同。在考虑PLMs的安全和可信赖部署时,我们希望它们的输出在意义相同或传达相同意图的提示下是一致的。虽然一些工作已经研究了最先进的PLMs如何满足这种需求,但它们仅限于评估单词或多词答案的词汇相等性,而且不涉及生成文本序列的一致性。为了在文本生成环境下理解PLMs的一致性,我们开发了一种语义一致性度量方法,它允许比较开放式文本输出。我们实现了几个版本的一致性度量标准来评估多个PLMs在TruthfulQA数据集中问题的不同版本的一致性。我们发现,我们提出的度量标准的一致性要比传统的体现词汇一致性的度量标准的一致性更强,并且还与人类评估输出一致性的度量标准相关性更高。