Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.
翻译:LLaMA、Mistral、Gemma等大型语言模型(LLMs)正日益应用于医疗、法律和金融等决策关键领域,但其可靠性仍不确定。它们常犯过度自信的错误,在输入变化下性能下降,且缺乏清晰的不确定性估计。现有评估方法零散,仅涉及孤立层面。我们提出了综合可靠性评分(CRS),这是一个将校准性、鲁棒性和不确定性量化整合为单一可解释指标的统一框架。通过在五个问答数据集上对十个领先开源LLMs进行实验,我们评估了其在基线条件、扰动及校准方法下的性能。CRS提供了稳定的模型排序,揭示了单一指标遗漏的隐藏故障模式,并表明最可靠的系统能平衡准确性、鲁棒性和校准后的不确定性。