Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons, and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate value decomposition into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.
翻译:设计强化学习( RL) 代理器通常是一个困难的过程, 需要无数的设计迭代。 学习可能失败, 标准 RL 方法提供的工具太少, 无法提供对确切原因的洞察力。 在本文中, 我们展示了如何将价值分解纳入广泛的行为者- 批评算法, 并用于协助迭代代理商设计过程。 价值分解将奖励功能分为不同的成分, 并学习每种成分的价值估计值。 这些价值估计提供了对代理商学习和决策过程的洞察力, 并使得新的培训方法能够缓解常见问题。 作为示范, 我们引入了SAC- D, 一种适应价值分解的软行为者- 激进( SAC) 变型( SAC) 。 SAC- D 与 SAC 保持相似的性能, 同时学习一系列更大的价值预测 。 我们还引入了基于分解的工具, 利用这一信息, 包括一种新的奖励影响度, 衡量每个奖项部分对代理商决策的影响。 我们通过这些工具, 提供了几次解分解用于确定和解决现有工具中易应用的动态和工具设计中的问题。