Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent's visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.
翻译:视觉-语言模型正越来越多地被部署为自主智能体,用于导航移动设备的图形用户界面。在动态的设备端生态系统中运行——包括通知、弹窗和跨应用交互——使它们面临一种独特且尚未被充分探索的攻击向量:环境注入。与操纵文本指令的基于提示的攻击不同,环境注入通过直接在图形用户界面中插入对抗性UI元素来破坏智能体的视觉感知。这绕过了文本安全防护措施,可能导致执行流程偏离,引发隐私泄露、财务损失或不可逆的设备危害。为了系统性地评估这一威胁,我们引入了GhostEI-Bench,这是首个在动态、可执行环境中评估移动智能体抵御环境注入攻击能力的基准测试。它超越了基于静态图像的评估,将对抗性事件注入到完全运行的安卓模拟器内的真实应用工作流中,并在关键风险场景下评估性能。我们进一步提出了一种基于评判大语言的协议,通过审查智能体的动作轨迹及相应的屏幕截图序列,进行细粒度的失败分析,精确定位感知、识别或推理环节的失败。对最先进智能体的综合实验揭示了其对欺骗性环境线索的显著脆弱性:当前模型在感知和推理被操纵的用户界面方面存在系统性失败。GhostEI-Bench为量化和缓解这一新兴威胁提供了一个框架,为开发更鲁棒、更安全的具身智能体铺平了道路。