In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our simple "avoid injuries" task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. With our chosen thresholds, filtering with our baseline classifier decreases the rate of unsafe completions from about 2.4% to 0.003% on in-distribution data, which is near the limit of our ability to measure. We found that adversarial training significantly increased robustness to the adversarial attacks that we trained on, without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.
翻译:在未来,强大的AI系统可能会被部署在高考环境,其中一个失败可能是灾难性的。提高高考环境的AI安全性的方法之一是对抗性训练,这种训练使用对手来创造培训范例,以便取得更好的最坏的绩效。在这项工作中,我们使用语言生成任务作为通过对抗性训练实现高度可靠性的测试点。我们创建了一系列对抗性训练技术,包括帮助人类对手的工具,以发现和消除在分类器中出现的失败,该分类器过滤出一个发电机建议的文本完成。在我们简单的“避免伤害”任务中,我们确定我们可以设置非常保守的分类器阈值,而不会显著影响过滤产出的质量。我们选择的阈值,用我们的基线分类器过滤,将不安全完成率从大约2.4%降至0.003%,这接近于我们测量能力的限度。我们发现,对抗性训练的对抗性训练大大加强了我们所训练的对抗性攻击的强度,而不影响分配性业绩。我们希望看到在高调时,在提高可靠性之前,在提高可靠性的高度可靠性方面进一步工作,我们希望看到提高可靠性的高度可靠性的可靠性,包括提高可靠性的可靠程度,直到我们更强大的部署能力的工具。