Deceptive UI designs, widely instantiated across the web and commonly known as dark patterns, manipulate users into performing actions misaligned with their goals. In this paper, we show that dark patterns are highly effective in steering agent trajectories, posing a significant risk to agent robustness. To quantify this risk, we introduce DECEPTICON, an environment for testing individual dark patterns in isolation. DECEPTICON includes 700 web navigation tasks with dark patterns -- 600 generated tasks and 100 real-world tasks, designed to measure instruction-following success and dark pattern effectiveness. Across state-of-the-art agents, we find dark patterns successfully steer agent trajectories towards malicious outcomes in over 70% of tested generated and real-world tasks -- compared to a human average of 31%. Moreover, we find that dark pattern effectiveness correlates positively with model size and test-time reasoning, making larger, more capable models more susceptible. Leading countermeasures against adversarial attacks, including in-context prompting and guardrail models, fail to consistently reduce the success rate of dark pattern interventions. Our findings reveal dark patterns as a latent and unmitigated risk to web agents, highlighting the urgent need for robust defenses against manipulative designs.
翻译:欺骗性用户界面设计在网络上广泛存在,通常被称为暗黑模式,其通过操纵用户执行与其目标不符的操作。本文中,我们证明暗黑模式在引导智能体行为轨迹方面极为有效,对智能体的鲁棒性构成重大风险。为量化此风险,我们引入了DECEPTICON,一个用于独立测试单个暗黑模式的环境。DECEPTICON包含700项涉及暗黑模式的网络导航任务——其中600项为生成任务,100项为真实世界任务,旨在衡量指令遵循成功率与暗黑模式的有效性。在各类先进智能体上的测试表明,暗黑模式成功引导智能体行为轨迹走向恶意结果的比例超过70%(生成任务与真实世界任务均如此),而人类受试者的平均比例仅为31%。此外,我们发现暗黑模式的有效性与模型规模及测试时推理能力呈正相关,使得规模更大、能力更强的模型更容易受到操纵。针对对抗性攻击的主流防御措施,包括上下文提示和护栏模型,均未能持续降低暗黑模式干预的成功率。我们的研究揭示了暗黑模式对网络智能体构成潜在且未缓解的风险,凸显了建立针对操纵性设计的鲁棒防御机制的迫切需求。