While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.
翻译:虽然经过事先培训的大型语言模型(PLM)在很多自然语言任务上表现出令人难以置信的流畅和表现,但最近的工作表明,表现良好的PLM对于其输入的提示非常敏感,即使提示在语义上是相同的,语言模型也可能提供非常不同的答案。在考虑安全和可信的部署PLM时,我们希望其产出在意味着相同事情或表达相同意图的提示下保持一致。虽然一些工作研究了最新技术的PLM如何满足这一需要,但仅限于评价单词或多字答案的词汇平等性,而没有处理基因化文本序列的一致性。为了理解在文本生成环境中PLMs的一致性,我们制定了一种语义一致性的尺度,以便能够比较开放式文本输出输出结果。我们采用了若干版本的一致性衡量标准,以评价在真相QA数据集中解说的问题版本上的一些PLMs的性能。我们发现,我们提议的衡量尺度比体现词法一致性的传统衡量标准更加一致,而且与人类对产出一致性的评价也具有更高程度的关联性。