Building systems that are good for society in the face of complex societal effects requires a dynamic approach. Recent approaches to machine learning (ML) documentation have demonstrated the promise of discursive frameworks for deliberation about these complexities. However, these developments have been grounded in a static ML paradigm, leaving the role of feedback and post-deployment performance unexamined. Meanwhile, recent work in reinforcement learning has shown that the effects of feedback and optimization objectives on system behavior can be wide-ranging and unpredictable. In this paper we sketch a framework for documenting deployed and iteratively updated learning systems, which we call Reward Reports. Taking inspiration from various contributions to the technical literature on reinforcement learning, we outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data. After presenting the elements of a Reward Report, we discuss a concrete example: Meta's BlenderBot 3 chatbot. Several others for game-playing (DeepMind's MuZero), content recommendation (MovieLens), and traffic control (Project Flow) are included in the appendix.
翻译:构建面对复杂社会影响而有利于社会的系统需要一种动态方法。最近的机器学习(ML)文档方法展示了一种针对这些复杂性进行商讨的话语框架的潜力。然而,这些发展基于一种静态的ML范式,未对反馈和部署后的性能进行研究。同时,强化学习方面的最新研究表明,反馈和优化目标对系统行为的影响可能具有广泛而难以预测的影响。在本文中,我们概述了一个名为 “奖励报告” 的框架,用于记录部署和逐步更新的学习系统。受强化学习技术文献多个贡献的启发,我们将 Reward Reports 简述为跟踪特定自动化系统背后的设计选择和假设所进行的优化工作的动态记载文档。它们旨在跟踪系统部署引起的动态现象,而不仅仅是模型或数据的静态属性。在介绍奖励报告的要素后,我们讨论了一个具体例子:Meta的 BlenderBot 3 聊天机器人。附录中还包含了其他几个奖励报告,涉及游戏播放(DeepMind的 MuZero)、内容推荐(MovieLens)和交通管制(Project Flow)。