Existing research on generative AI security is primarily driven by mutually reinforcing attack and defense methodologies grounded in empirical experience. This dynamic frequently gives rise to previously unknown attacks that can circumvent current detection and prevention. This necessitates the continual updating of security mechanisms. Constructing generative AI with provable security and theoretically controllable risk is therefore necessary. Consensus Sampling (CS) is a promising algorithm toward provably secure AI. It controls risk by leveraging overlap in model output probabilities. However, we find that CS relies on frequent abstention to avoid unsafe outputs, which reduces utility. Moreover, CS becomes highly vulnerable when unsafe models are maliciously manipulated. To address these issues, we propose a new primitive called Reliable Consensus Sampling (RCS), that traces acceptance probability to tolerate extreme adversarial behaviors, improving robustness. RCS also eliminates the need for abstention entirely. We further develop a feedback algorithm to continuously and dynamically enhance the safety of RCS. We provide theoretical guarantees that RCS maintains a controllable risk threshold. Extensive experiments show that RCS significantly improves robustness and utility while maintaining latency comparable to CS. We hope this work contributes to the development of provably secure generative AI.
翻译:现有关于生成式人工智能安全的研究主要基于经验驱动的攻防方法相互促进。这种动态常常导致先前未知的攻击能够绕过当前的检测与防御机制,从而需要持续更新安全措施。因此,构建具有可证明安全性且理论上风险可控的生成式人工智能是必要的。共识采样(CS)是一种实现可证明安全人工智能的有前景的算法,它通过利用模型输出概率的重叠来控制风险。然而,我们发现CS依赖频繁的弃权来避免不安全输出,这降低了其实用性。此外,当不安全模型被恶意操纵时,CS变得极为脆弱。为解决这些问题,我们提出了一种称为可靠共识采样(RCS)的新原语,它通过追踪接受概率来容忍极端对抗行为,从而提升鲁棒性。RCS还完全消除了对弃权的需求。我们进一步开发了一种反馈算法,以持续动态地增强RCS的安全性。我们提供了理论保证,证明RCS能够维持可控的风险阈值。大量实验表明,RCS在保持与CS相当的延迟的同时,显著提高了鲁棒性和实用性。我们希望这项工作能为可证明安全的生成式人工智能的发展做出贡献。