强化学习奖励报告 (Reward Reports for Reinforcement Learning)

The desire to build good systems in the face of complex societal effects requires a dynamic approach towards equity and access. Recent approaches to machine learning (ML) documentation have demonstrated the promise of discursive frameworks for deliberation about these complexities. However, these developments have been grounded in a static ML paradigm, leaving the role of feedback and post-deployment performance unexamined. Meanwhile, recent work in reinforcement learning design has shown that the effects of optimization objectives on the resultant system behavior can be wide-ranging and unpredictable. In this paper we sketch a framework for documenting deployed learning systems, which we call Reward Reports. Taking inspiration from various contributions to the technical literature on reinforcement learning, we outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data. After presenting the elements of a Reward Report, we provide three examples: DeepMind's MuZero, MovieLens, and a hypothetical deployment of a Project Flow traffic control policy.

翻译：在复杂的社会影响面前建立良好系统的愿望要求以动态方式对待公平和准入问题。最近对机器学习(ML)文件采取的办法表明,有希望为审议这些复杂问题建立不准确的框架。然而,这些发展是建立在静态的ML范式基础上的,使得反馈和部署后业绩的作用没有受到审查。与此同时,最近的强化学习设计工作表明,优化目标对由此产生的系统行为的影响可以是广泛和不可预测的。在这份文件中,我们勾画了一个记录已部署的学习系统的框架,我们称之为Reward报告。从对关于强化学习的技术文献的各种贡献中得到的启发,我们把Reward报告作为活的文件,跟踪设计选择和假设的更新情况,以了解特定自动化系统的最佳用途。它们的目的是跟踪系统部署产生的动态现象,而不仅仅是模型或数据的静态特性。在提出一份奖励报告的内容之后,我们举三个例子:DeepMind's Muzero,MemoLens,以及假设部署项目流量控制政策。