Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
翻译:以ChatGPT和Alexa+为代表的现代对话智能体依赖于预定义策略,这些策略规定了元数据、响应风格及工具使用规则。随着此类基于大语言模型的系统不断扩展以支持多样化的商业和用户查询,通常以上下文提示形式实现的策略正变得日益复杂冗长,导致策略的忠实遵循变得困难,并产生高昂的固定计算成本。随着多模态智能体的兴起,管理视觉与多模态行为的策略至关重要,但相关研究仍显不足。现有的提示压缩研究主要聚焦于任务模板和示例的简化,而策略对齐研究仅关注基于文本的安全规则。本文提出多模态策略内化这一新任务,将推理密集型多模态策略内化至模型参数中,从而在无需推理阶段包含策略文本的情况下实现更强的策略遵循能力。该任务面临独特的数据与算法挑战。我们构建了两个涵盖合成场景与真实世界决策及工具使用任务的数据集,并提出TriMPI三阶段训练框架:首先通过持续预训练注入策略知识,随后进行监督微调,最终采用PolicyRollout——一种GRPO风格的强化学习扩展方法,通过策略感知响应增强推演过程以实现基于场景的探索。TriMPI在端到端准确性、泛化能力及抗遗忘鲁棒性方面取得显著提升。作为多模态策略内化领域的开创性研究,我们公开了数据集、训练方案及系统评估,以推动后续研究。项目页面:https://mikewangwzhl.github.io/TriMPI。