AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI's harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this "responsibility gap" are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions.
翻译:人工智能系统既可能带来益处也可能造成危害,但若缺乏持续严格的对抗性评估,人工智能行为体将难以全面评估其风险面的广度与严重程度。系统设计领域的研究人员已开发出多种针对偏见、仇恨言论、虚假/错误信息及其他已证实危害类别的有效社会技术评估与红队测试方法。然而,随着日益复杂的人工智能系统被部署至教育、医疗、情报收集等高风险领域,现有评估与监测方法正逐渐难以实现有效监管。为真正实现负责任的人工智能发展,确保其危害得到充分认知且安全漏洞得以缓解,开创性方法以弥合这一"责任鸿沟"的需求比以往任何时候都更为迫切。本文提出一种协同式公共人工智能红队测试方法,并讨论其前期试点实施成果。该方法与CAMLIS会议本身紧密关联:首次线下公共演示活动即与CAMLIS 2024同期举行。我们系统梳理了该活动的运营设计与结果,以及美国国家标准与技术研究院(NIST)"人工智能风险评估与影响分析"试点项目、新加坡资讯通信媒体发展局(IMDA)开展的同类测试。最终论证表明,该方法不仅能产生实质性成果,且具备在多个人工智能发展辖区的可扩展性。