Large language models can now generate intermediate reasoning steps before producing answers, improving performance on difficult problems by interactively developing solutions. This study uses a content moderation task to examine parallels between human decision times and model reasoning effort, measured using the length of the chain-of-thought (CoT). Across three frontier models, CoT length consistently predicts human decision time. Moreover, humans took longer and models produced longer CoTs when important variables were held constant, suggesting similar sensitivity to task difficulty. Analyses of the CoT content shows that models reference various contextual factors more frequently when making such decisions. These findings show parallels between human and AI reasoning on practical tasks and underscore the potential of reasoning traces for enhancing interpretability and decision-making.
翻译:大型语言模型现在能够在生成答案前产生中间推理步骤,通过交互式开发解决方案来提高复杂问题的处理性能。本研究采用内容审核任务,探讨人类决策时间与模型推理努力之间的相似性,其中推理努力通过思维链长度进行量化。在三个前沿模型中,思维链长度始终能够预测人类决策时间。此外,当重要变量保持恒定时,人类需要更长的决策时间,模型也会产生更长的思维链,这表明两者对任务难度具有相似的敏感性。对思维链内容的分析显示,模型在进行此类决策时会更频繁地引用各种情境因素。这些发现揭示了人类与人工智能在实际任务中推理过程的相似性,并强调了推理轨迹在增强可解释性和决策制定方面的潜力。