迈向可证明安全的生成式人工智能：可靠共识采样 (Towards Provably Secure Generative AI: Reliable Consensus Sampling)

Existing research on generative AI security is primarily driven by mutually reinforcing attack and defense methodologies grounded in empirical experience. This dynamic frequently gives rise to previously unknown attacks that can circumvent current detection and prevention. This necessitates the continual updating of security mechanisms. Constructing generative AI with provable security and theoretically controllable risk is therefore necessary. Consensus Sampling (CS) is a promising algorithm toward provably secure AI. It controls risk by leveraging overlap in model output probabilities. However, we find that CS relies on frequent abstention to avoid unsafe outputs, which reduces utility. Moreover, CS becomes highly vulnerable when unsafe models are maliciously manipulated. To address these issues, we propose a new primitive called Reliable Consensus Sampling (RCS), that traces acceptance probability to tolerate extreme adversarial behaviors, improving robustness. RCS also eliminates the need for abstention entirely. We further develop a feedback algorithm to continuously and dynamically enhance the safety of RCS. We provide theoretical guarantees that RCS maintains a controllable risk threshold. Extensive experiments show that RCS significantly improves robustness and utility while maintaining latency comparable to CS. We hope this work contributes to the development of provably secure generative AI.

翻译：现有关于生成式人工智能安全的研究主要基于经验驱动的攻防方法相互促进。这种动态常常导致先前未知的攻击能够绕过当前的检测与防御机制，从而需要持续更新安全措施。因此，构建具有可证明安全性且理论上风险可控的生成式人工智能是必要的。共识采样（CS）是一种实现可证明安全人工智能的有前景的算法，它通过利用模型输出概率的重叠来控制风险。然而，我们发现CS依赖频繁的弃权来避免不安全输出，这降低了其实用性。此外，当不安全模型被恶意操纵时，CS变得极为脆弱。为解决这些问题，我们提出了一种称为可靠共识采样（RCS）的新原语，它通过追踪接受概率来容忍极端对抗行为，从而提升鲁棒性。RCS还完全消除了对弃权的需求。我们进一步开发了一种反馈算法，以持续动态地增强RCS的安全性。我们提供了理论保证，证明RCS能够维持可控的风险阈值。大量实验表明，RCS在保持与CS相当的延迟的同时，显著提高了鲁棒性和实用性。我们希望这项工作能为可证明安全的生成式人工智能的发展做出贡献。

相关内容

计算机科学

关注 56

计算机科学（Computer Science, CS）是系统性研究信息与计算的理论基础以及它们在计算机系统中如何实现与应用的实用技术的学科。它通常被形容为对那些创造、描述以及转换信息的算法处理的系统研究。计算机科学包含很多分支领域；其中一些，比如计算机图形学强调特定结果的计算，而另外一些，比如计算复杂性理论是学习计算问题的性质。还有一些领域专注于挑战怎样实现计算。比如程序设计语言理论学习描述计算的方法，而程序设计是应用特定的程序设计语言解决特定的计算问题，人机交互则是专注于挑战怎样使计算机和计算变得有用、可用，以及随时随地为人所用。 现代计算机科学( Computer Science)包含理论计算机科学和应用计算机科学两大分支。

美海军《表征军事领域的新奇性》开发和评估对新事物具有鲁棒性的智能体；DARPA人工智能科学和开放世界新事物学习（SAIL-ON）项目

专知会员服务

31+阅读 · 2023年3月1日

强化学习在机器人中的应用，附视频与Slides，Animesh Garg, UoT

专知会员服务

37+阅读 · 2022年7月12日

【CVPR 2022】基于双噪声标签的可见光-红外人再识别学习，Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification

专知会员服务

14+阅读 · 2022年3月28日

【MM 2021】基于统一中间模态学习的视红外人再识别,Towards a Unified Middle Modality Learning for Visible-Infrared Person Re-Identification

专知会员服务

12+阅读 · 2022年3月22日