Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
翻译:多模态大推理模型(MLRMs)正日益广泛地应用于需要生成显式中间推理依据的视觉-语言任务中。然而,即使最终答案是无害的,推理轨迹也可能包含不安全内容,从而带来部署风险。现有的多模态安全防护方法主要仅评估输入问题和最终答案,忽略了中间推理过程。这种疏忽可能导致在推理过程中出现未被检测到的危害,例如带有偏见的推断或违反政策地使用视觉上下文。我们提出了GuardTrace-VL,一种具备视觉感知能力的安全审计器,它通过联合图像-文本分析来监控完整的“问题-思考-答案”(QTA)流程,从而能够在推理阶段检测到不安全内容的出现。为了支持训练和评估,我们构建了GuardTrace数据集,该数据集通过多样化的提示策略生成,并经过一个基于MLRM和人工的投票与验证流程进行精炼。此外,我们提出了一种结合数据精炼过程的三阶段渐进式训练方案,使模型能够根据不同风险级别学习细致且依赖于上下文的安全偏好。在我们提出的涵盖域内和域外场景的测试集上,GuardTrace-VL模型在不安全推理检测任务中取得了93.1%的F1分数,相较于之前最强的多模态安全防御方法,F1分数提升了13.5%。相关代码将公开提供。