Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
翻译:从大型语言模型(LLM)中采样多个输出,并选择最频繁(自洽性)或得分最高(最佳N选一)的候选结果,是在具有离散最终答案的任务中实现更高准确性的常用方法。最佳N选一(BoN)选择具有最高奖励的输出,在奖励完美的情况下,通常能达到接近完美的准确性。然而,当使用奖励模型产生的不完美奖励时,BoN无法可靠地找到正确答案,其性能会急剧下降。我们分析了BoN输出的分布特征,并指出:尽管在不完美奖励下,正确答案通常不具有接近1的概率,但它往往是最可能的结果。这表明该分布的众数可能比从中抽取的样本更可靠地正确。基于这一思想,我们提出了多数最佳(MoB),这是一种新颖的选择机制,通过自助法估计BoN的输出分布并选择其众数。在五个基准测试、三种不同基础LLM和两种奖励模型上的实验结果表明,在30种实验设置中的25种里,MoB相比BoN均取得了稳定提升。我们还提供了自助法一致性的理论结果。MoB作为一种简单而强大的方法,为BoN和自洽性提供了替代方案,更广泛地推动了更精细选择机制的进一步研究。