从离线多机构强化学习的良好轨迹中学习</s> (Learning from Good Trajectories in Offline Multi-Agent Reinforcement Learning)

Offline multi-agent reinforcement learning (MARL) aims to learn effective multi-agent policies from pre-collected datasets, which is an important step toward the deployment of multi-agent systems in real-world applications. However, in practice, each individual behavior policy that generates multi-agent joint trajectories usually has a different level of how well it performs. e.g., an agent is a random policy while other agents are medium policies. In the cooperative game with global reward, one agent learned by existing offline MARL often inherits this random policy, jeopardizing the performance of the entire team. In this paper, we investigate offline MARL with explicit consideration on the diversity of agent-wise trajectories and propose a novel framework called Shared Individual Trajectories (SIT) to address this problem. Specifically, an attention-based reward decomposition network assigns the credit to each agent through a differentiable key-value memory mechanism in an offline manner. These decomposed credits are then used to reconstruct the joint offline datasets into prioritized experience replay with individual trajectories, thereafter agents can share their good trajectories and conservatively train their policies with a graph attention network (GAT) based critic. We evaluate our method in both discrete control (i.e., StarCraft II and multi-agent particle environment) and continuous control (i.e, multi-agent mujoco). The results indicate that our method achieves significantly better results in complex and mixed offline multi-agent datasets, especially when the difference of data quality between individual trajectories is large.

翻译：离线多试剂强化学习(MARL)的目的是从预收集的数据集中学习有效的多试剂政策,这是向在现实世界应用中部署多试剂系统迈出的重要一步,但在实践中,产生多试剂联合轨迹的每一种个人行为政策通常都有不同的表现水平。例如,一个代理是一种随机政策,而其他代理则是中等政策。在与全球奖励的合作游戏中,一个从现有离线的MARL学到的代理人往往继承这一随机政策,从而损害整个团队的绩效。在本文中,我们对离线的MARL进行了调查,明确考虑到代理轨迹的多样性,并提出了一个名为“共享单个轨迹(SIT)”的新框架,以解决这一问题。具体地说,基于关注的奖赏分解网络通过不同的关键值记忆机制,以离线方式赋予了每个代理商的信用。这些分解的信用随后被用来将联合离线的离线数据数据集重建成与个人轨迹图轨迹的优先重现。随后,在基于我们的数据轨迹和直径导的精度系统中,可以分享它们的良好方法。</s>