As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we conclude detection methods covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. To support ongoing work in this area, we release a living resource at www.deceptionsurvey.com.
翻译:随着智能水平的提升,其潜在阴影亦随之扩大。人工智能欺骗——即系统通过诱导错误信念以获取自身利益的行为——已从理论担忧演变为在语言模型、AI智能体及新兴前沿系统中经实证证实的风险。本项目全面且及时地综述了人工智能欺骗领域,涵盖其核心概念、方法论、起源及潜在缓解措施。首先,我们基于动物欺骗研究中的信号理论,提出了人工智能欺骗的形式化定义。随后,我们回顾了现有实证研究及相关风险,强调欺骗作为一项社会技术安全挑战的重要性。我们将人工智能欺骗研究体系组织为一个欺骗循环,包含两个关键组成部分:欺骗涌现与欺骗处理。欺骗涌现揭示了人工智能欺骗的内在机制:当具备足够能力与潜在激励的系统在外部条件触发下,不可避免地会采取欺骗行为。欺骗处理则侧重于检测与应对此类行为。在欺骗涌现方面,我们分析了三个层级上的激励基础,并识别出欺骗所需的三项关键能力前提。我们进一步探讨了情境触发因素,包括监督缺失、分布偏移及环境压力。在欺骗处理方面,我们总结了涵盖静态与交互场景下基准测试与评估协议的检测方法。基于欺骗涌现的三个核心因素,我们概述了潜在的缓解策略,并提出整合技术、社群与治理工作的审计方法,以应对社会技术挑战及未来人工智能风险。为支持该领域的持续研究,我们在 www.deceptionsurvey.com 发布了动态资源库。