The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against critical social norms, leading models like GPT-4o and Claude-3.5-haiku disregarded the social norm over 15\% of the time. These findings, demonstrated by our benchmark, underscore a fundamental misalignment in LLMs regarding physically grounded privacy and establish the need for more robust, physically-aware alignment. Codes and datasets will be available at https://github.com/Graph-COM/EAPrivacy.
翻译:大型语言模型(LLM)在具身智能体中的部署,迫切要求我们衡量其在物理世界中的隐私意识。然而,现有的评估方法仅限于基于自然语言的场景。为弥补这一差距,我们引入了EAPrivacy,一个旨在量化LLM驱动智能体物理世界隐私意识的综合评估基准。EAPrivacy利用程序生成的、跨越四个层级的场景,来测试智能体处理敏感对象、适应变化环境、平衡任务执行与隐私约束以及解决与社会规范冲突的能力。我们的测量揭示了当前模型存在严重缺陷。表现最佳的模型Gemini 2.5 Pro,在涉及变化的物理环境的场景中仅达到59%的准确率。此外,当任务伴随隐私请求时,模型在高达86%的情况下优先考虑完成任务而非遵守约束。在隐私与关键社会规范相冲突的高风险情境中,像GPT-4o和Claude-3.5-haiku这样的领先模型,有超过15%的次数忽视了社会规范。我们的基准测试所展示的这些发现,突显了LLM在物理世界隐私方面存在根本性的错位,并表明需要更鲁棒、更具物理世界意识的模型对齐。代码和数据集将在 https://github.com/Graph-COM/EAPrivacy 提供。