Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
翻译:大型语言模型(LLMs)生成的输出具有不同程度的确定性,且其正确性亦常呈现显著波动,这导致其实际可靠性远未得到保障。为量化此类不确定性,我们系统评估了四种LLM输出置信度估计方法:VCE、MSP、样本一致性及CoCoA(Vashurin等人,2025)。在评估过程中,我们基于前沿开源LLM在四项问答任务上开展实验。结果表明,每种不确定性度量指标均捕捉了模型置信度的不同维度,而混合型CoCoA方法在整体可靠性上表现最优,同时提升了答案正确性的校准能力与判别性能。我们深入探讨了各方法的权衡特性,并为LLM应用中不确定性度量的选择提供了实践建议。