In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.
翻译:在未来,强大的AI系统可能会在高考环境部署,其中一个失败可能是灾难性的。提高高考环境的AI安全性的方法之一是对抗性培训,这种培训使用对手来生成培训范例,以便实现更好的最坏业绩。在这项工作中,我们使用一种安全的语言生成任务(“避免伤害”)作为测试台,通过对抗性培训实现高可靠性。我们创建了一系列对抗性培训技术,包括一个帮助人类对手的工具,以发现和消除在筛选生成者建议的文本完成的分类器中的失败。在我们的任务中,我们确定我们可以设定非常保守的分类标准,而不会严重影响过滤输出的质量。我们发现,对抗性培训提高了我们所训练的对抗性攻击的稳健性 -- -- 将我们的承包商用我们的工具找到对抗性例子的时间翻一番(13至26分钟),而没有(从20至44分钟),同时不影响分配性表现。我们希望在高可靠性的可靠性设置中看到进一步的工作,包括更强大的部署可靠性的工具,以至更强大地衡量高水平的可能性。