更好地在保守主义中制定计划:离线多机构强化学习,并校正行为者 (Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification)

The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, it is still an open question to resolve offline RL in the more practical multi-agent setting as many real-world scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, when conservatism-based algorithms are applied to the multi-agent setting, the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify that a key issue that the landscape of the value function can be non-concave and policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), to tackle this critical challenge via an effective combination of first-order policy gradient and zeroth-order optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks.

翻译：保守主义思想已导致脱线强化学习(RL)取得显著进展,因为代理商从预收集的数据集中学习了离线强化学习(RL),然而,在更实际的多试剂环境下解决离线强化学习(RL)仍然是一个未决问题,因为许多现实世界情景都涉及多个代理商之间的互动。鉴于最近将在线RL算法转移到多试剂环境的成功,人们可能会预计脱线的RL算法也会直接转移到多试剂环境。令人惊讶的是,当多试剂设置应用基于保守主义的算法时,性能会随着代理商数量的增加而显著下降。为减缓退化,我们确定一个关键问题是,价值功能的景观可能是非相容的,政策梯度的改进容易于当地选择。由于任何代理商的亚优政策可能导致不协调的全球失败,多试算法也会直接传递给多试剂环境。根据这种直觉,我们提出了一个简单有效的方法,即基于Autor Recrigication的基于OMAR值的调算法(OMAR),以便以最优化的稳性标准化为最强的稳性标准,解决这一关键的挑战。