Multi-agent reinforcement learning has drawn increasing attention in practice, e.g., robotics and automatic driving, as it can explore optimal policies using samples generated by interacting with the environment. However, high reward uncertainty still remains a problem when we want to train a satisfactory model, because obtaining high-quality reward feedback is usually expensive and even infeasible. To handle this issue, previous methods mainly focus on passive reward correction. At the same time, recent active reward estimation methods have proven to be a recipe for reducing the effect of reward uncertainty. In this paper, we propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL). Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. Specifically, we design the multi-action-branch reward estimation to model reward distributions on all action branches. Then we utilize reward aggregation to obtain stable updating signals during training. Our intuition is that consideration of all possible consequences of actions could be useful for learning policies. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
翻译:多剂强化学习在实践中引起了越来越多的注意,例如机器人和自动驾驶等,因为它能够利用与环境互动产生的样本探索最佳政策,然而,当我们希望培训一个令人满意的模式时,奖励不确定性仍然很高,因为获得高质量的奖赏反馈通常费用昂贵,甚至不可行。为了处理这一问题,以往的方法主要侧重于被动奖励的纠正。与此同时,最近的积极奖赏估计方法已证明是减少奖赏不确定性影响的秘方。在本文件中,我们提议了一个新的分配再向激励框架,用于有效的多剂强化强化学习(DRE-MARL),我们的主要想法是设计多行动部门奖励估计和政策加权奖励汇总,用于稳定的培训。具体地说,我们设计多行动部门奖励估计,用于所有行动分支的奖赏分配模式。然后,我们利用奖励汇总来获得培训期间的稳定更新信号。我们的直觉是,考虑行动的所有可能后果,对于学习政策是有用的。DRE-MAR的优势是使用基准多剂情景,与SOTA基准条款的优势比较。