Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
翻译:基于一致性的方法已成为大语言模型不确定性量化的有效途径。这类方法通常依赖通过多项采样获得的多个生成结果,并测量其一致性程度。然而,在短问答任务中,由于概率分布高度集中,多项采样容易产生重复样本,且其随机性会导致不同运行间不确定性估计存在显著方差。我们提出了一类新方法,采用束搜索生成候选样本以进行基于一致性的不确定性量化,与多项采样相比,该方法在提升性能的同时降低了方差。我们还从理论上给出了束搜索在束集概率质量下限条件下比多项采样具有更小误差的证明。我们在六个问答数据集上进行了实证评估,发现该方法相对于多项采样具有稳定的改进效果,从而实现了当前最优的不确定性量化性能。