Structured deliberation has been found to improve the performance of human forecasters. This study investigates whether a similar intervention, i.e. allowing LLMs to review each other's forecasts before updating, can improve accuracy in large language models (GPT-5, Claude Sonnet 4.5, Gemini Pro 2.5). Using 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament, accuracy was assessed across four scenarios: (1) diverse models with distributed information, (2) diverse models with shared information, (3) homogeneous models with distributed information, and (4) homogeneous models with shared information. Results show that the intervention significantly improves accuracy in scenario (2), reducing Log Loss by 0.020 or about 4 percent in relative terms (p = 0.017). However, when homogeneous groups (three instances of the same model) engaged in the same process, no benefit was observed. Unexpectedly, providing LLMs with additional contextual information did not improve forecast accuracy, limiting our ability to study information pooling as a mechanism. Our findings suggest that deliberation may be a viable strategy for improving LLM forecasting.
翻译:结构化审议已被证实能够提升人类预测者的表现。本研究探讨了类似的干预措施——即允许大型语言模型(GPT-5、Claude Sonnet 4.5、Gemini Pro 2.5)在更新预测前相互审阅彼此的预测——是否能够提高其预测准确性。利用Metaculus 2025年第二季度AI预测锦标赛中已解决的202个二元问题,我们在四种情境下评估了预测准确性:(1)信息分散的异质模型组,(2)信息共享的异质模型组,(3)信息分散的同质模型组,以及(4)信息共享的同质模型组。结果显示,该干预措施在情境(2)中显著提升了准确性,使对数损失降低了0.020,相对改善约4%(p = 0.017)。然而,当同质模型组(同一模型的三个实例)进行相同流程时,未观察到任何益处。出乎意料的是,为LLM提供额外的上下文信息并未改善预测准确性,这限制了我们研究信息汇集作为潜在机制的能力。我们的研究结果表明,审议可能是提升LLM预测准确性的一种可行策略。