This paper proposes a definition of system health in the context of multiple agents optimizing a joint reward function. We use this definition as a credit assignment term in a policy gradient algorithm to distinguish the contributions of individual agents to the global reward. The health-informed credit assignment is then extended to a multi-agent variant of the proximal policy optimization algorithm and demonstrated on particle and multiwalker robot environments that have characteristics such as system health, risk-taking, semi-expendable agents, continuous action spaces, and partial observability. We show significant improvement in learning performance compared to policy gradient methods that do not perform multi-agent credit assignment.
翻译:本文件在多个代理机构优化联合奖励功能的背景下提出了系统健康的定义。我们用这一定义作为政策梯度算法中的信用分配术语,以区分单个代理机构对全球奖励的贡献。然后,健康知情的信用分配扩展至准政策优化算法的多试剂变体,并在具有系统健康、风险承担、半消耗性代理、连续行动空间和部分可耐性等特征的粒子和多行人机器人环境中演示。与不执行多试剂信用分配的政策梯度方法相比,我们在学习业绩方面有显著改进。