并行思考，统一作答：基于对数概率平均的开放式推理方法 (Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning)

Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.

翻译：多数投票法通过聚合并行推理轨迹，在封闭式问答中已被证明是有效的。然而，它并不直接适用于开放式推理任务，例如代码生成和基于网络的深度研究，因为在这些任务中，对完整解决方案进行“多数”表决的定义并不明确。我们提出了ThinkMerge，一种无需训练、即插即用的解码策略，该策略运行K条并行推理轨迹，并在同步点对其下一个词元的对数概率进行平均，以生成单一连贯的输出。ThinkMerge可与vLLM/SGLang无缝集成，并保持与Top-p/Top-k等标准解码技术的兼容性。实证结果表明，在AIME和GPQA数据集上，其性能达到或超越了多数投票法，同时在开放式编码任务中取得了稳定的提升：在LiveCodeBench（困难版）上，DeepCoder-14B-Preview的pass@1指标提升了+8.28%，Qwen3-8B提升了+7.58%。除代码任务外，我们进一步证明ThinkMerge在GAIA、BrowseComp-en/zh和XbenchDeepSearch数据集上，提升了基于网络的深度研究智能体（例如WebSailor-7B/32B）的性能。这些结果表明，并行测试时扩展可以在不依赖对完整输出进行投票的情况下，使开放式推理受益。