Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
翻译:近年来,大型语言模型(LLMs)的发展已从显式的思维链(CoT)推理转向更高效的潜在推理,其中中间思维以向量而非文本形式表示。然而,在需要稳健推理的、具有挑战性的分布外任务上,潜在推理可能表现出脆弱性。为克服这些局限,我们提出了潜在思维策略优化(LTPO),这是一种无需参数更新的、完全在测试时增强LLM推理能力的无参数框架。LTPO将中间潜在“思维”向量视为动态参数,针对每个问题实例进行主动优化。该方法采用在线策略梯度算法,其引导信号为基于置信度的内在奖励,该奖励直接根据冻结LLM自身的输出分布计算得出,从而在优化过程中无需外部监督或昂贵的文本生成。在五个推理基准上的大量实验表明,LTPO不仅在标准任务上达到或超越了强基线方法的性能,还在其他方法失效时展现出显著的鲁棒性。尤为突出的是,在极具挑战性的AIME基准测试中,现有潜在推理基线方法的准确率近乎崩溃至零,而LTPO则实现了大幅提升,展现了其在复杂推理方面的独特能力。