While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
翻译:尽管生成模型,特别是大语言模型(LLM),在当今世界已无处不在,但评估其(不)正确性的原则性机制仍较为有限。先前研究基于共形预测框架,构建了LLM响应集合,其中包含错误响应的概率被限制在用户定义的期望容忍水平内。然而,由于这些方法基于p值,它们容易受到p-hacking的影响,即事后选择容忍水平可能使统计保证失效。因此,我们利用e值,将e-分数作为衡量不正确性的指标,以补充生成模型的输出。除了实现与先前相同的统计保证外,e-分数还为用户提供了在观察e-分数后自适应选择容忍水平的灵活性,通过上界控制一种称为规模扭曲的事后误差概念。我们通过实验验证了e-分数在评估LLM输出对不同正确性类型(数学事实性与属性约束满足)的有效性。