When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.
翻译:当人们进行因果推理时,常常会考虑多种相互竞争的“假设”情景,然后决定哪种解释最为合理。类似地,能够进行因果推理的先进语言模型可以考虑多种干预与反事实,以判断因果主张的有效性。关键在于,这类推理过程更类似于不同假设之间的内部对话,而非单一的计算过程。本文通过双智能体辩论框架使这一对话显式化:其中一个模型提供结构化的因果推理,另一个则批判性地审查该推理的逻辑缺陷。当出现分歧时,智能体尝试相互说服,挑战对方的逻辑并修正结论,直至达成共识。为充分利用这一审议过程,我们特别采用推理语言模型,其在因果推理与对抗性辩论方面的潜力相较于标准大语言模型尚未得到充分探索。我们在CLadder数据集上评估了该方法,该数据集将自然语言问题与Pearl因果阶梯全部三个层级的形式化因果图相关联。以Qwen3和DeepSeek-R1作为辩论智能体,实验表明多智能体辩论将DeepSeek-R1在因果推理中的总体准确率从78.03%提升至87.45%,其中反事实类别的准确率从67.94%显著提升至80.04%。同样,Qwen3的总体准确率从84.16%提升至89.41%,反事实问题准确率从71.53%提升至80.35%,这表明即使强模型仍可通过与较弱智能体的辩论获得显著增益。我们的研究结果凸显了推理模型作为因果推理多智能体系统基础组件的潜力,并证明了多元视角在因果问题求解中的重要性。