Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor's test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor's test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
翻译:为安全部署日益强大的智能体,可能需要对现代人工智能系统的决策过程进行可观测性。监测当前推理模型的思维链已被证明能有效检测异常行为。然而,这种"可监测性"在不同训练流程、数据源甚至持续的系统扩展下可能具有脆弱性。为量化和追踪可监测性,我们提出三种评估范式(干预型、过程型和结果属性型)及新的可监测性度量标准,并构建了综合性评估套件。我们证明这些评估方法能有效识别经过训练以隐藏思维链的简单模型有机体,且在实际场景中思维链监测比仅监测行动更有效。通过比较各类前沿模型的可监测性,我们发现大多数模型具有相当程度但非完美的可监测性。我们还评估了可监测性如何随推理计算量、强化学习优化和预训练模型规模而变化。研究发现:更长的思维链通常具有更高可监测性;在当前前沿规模下,强化学习优化不会实质性降低可监测性。值得注意的是,对于低推理投入的模型,我们可以部署更小模型但提高其推理投入(从而保持能力相当)来获得更高可监测性,尽管这会增加总体推理计算成本。我们进一步研究智能体-监测器的规模扩展规律,发现增强弱监测器的测试时计算资源来监测强智能体时,可监测性会提升。赋予弱监测器思维链访问权限不仅能提高可监测性,还会使监测器的测试时计算资源与可监测性之间的扩展曲线变得更陡峭。最后,我们通过向模型提出后续问题并将其后续思维链提供给监测器,展示了提升可监测性的可行方法。