多语言思维链推理的综合评估：跨语言的性能、一致性与忠实度 (A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages)

Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.

翻译：大型推理模型（LRMs）日益依赖逐步的思维链推理来提升任务性能，尤其在英语等高资源语言中。尽管近期研究已考察了多语言环境下的最终答案准确性，但思维轨迹本身——即导致最终答案的中间步骤——仍未得到充分探索。本文首次对多语言思维链推理进行了全面研究，从三个关键维度进行评估：性能、一致性与忠实度。我们首先测量了当LRMs被明确指示或通过提示工程以目标语言思考时的语言遵从性、答案准确性与答案一致性，揭示了强烈的语言偏好及跨语言的性能差异。接着，我们通过在不同语言间互换思维轨迹来评估其跨语言一致性，发现思维轨迹的质量与有效性因提示语言的不同而存在显著差异。最后，我们采用基于扰动的方法（即截断与错误注入）来探究跨语言思维轨迹的忠实度，结果表明模型对思维轨迹的依赖程度存在差异。我们公开了代码与数据以支持未来研究。