基于推理与强化学习的大语言模型情境完整性研究 (Contextual Integrity in LLMs via Reasoning and Reinforcement Learning)

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only $\sim700$ examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls. Our code is available at: https://github.com/EricGLan/CI-RL

翻译：随着自主代理为用户进行决策的时代到来，确保情境完整性——即在执行特定任务时分享何种信息是恰当的——成为该领域的核心问题。我们主张，情境完整性要求一种推理形式，即代理需要对其所处的情境进行推理。为验证这一点，我们首先通过提示大语言模型在决定披露信息时对情境完整性进行显式推理。随后，我们通过开发一个强化学习框架来扩展此方法，该框架进一步向模型灌输实现情境完整性所需的推理能力。利用一个仅包含约700个示例但具有多样化情境与信息披露规范的自动生成合成数据集，我们证明该方法能显著减少不当信息披露，同时在多种模型规模与系列中保持任务性能。重要的是，由此合成数据集获得的性能提升能够迁移至已建立的情境完整性基准测试（如具有人工标注并评估AI助手在行动与工具调用中隐私泄露的PrivacyLens）。我们的代码公开于：https://github.com/EricGLan/CI-RL