Reinforcement learning (RL) in partially observable, fully cooperative multi-agent settings (Dec-POMDPs) can in principle be used to address many real-world challenges such as controlling a swarm of rescue robots or a synchronous team of quadcopters. However, Dec-POMDPs are significantly harder to solve than single-agent problems, with the former being NEXP-complete and the latter, MDPs, being just P-complete. Hence, current RL algorithms for Dec-POMDPs suffer from poor sample complexity, thereby reducing their applicability to practical problems where environment interaction is costly. Our key insight is that using just a polynomial number of samples, one can learn a centralized model that generalizes across different policies. We can then optimize the policy within the learned model instead of the true system, reducing the number of environment interactions. We also learn a centralized exploration policy within our model that learns to collect additional data in state-action regions with high model uncertainty. Finally, we empirically evaluate the proposed model-based algorithm, MARCO, in three cooperative communication tasks, where it improves sample efficiency by up to 20x.
翻译:在部分可见的、完全合作的多试剂环境中的强化学习(RL)原则上可以用来应对许多现实世界的挑战,例如控制救援机器人群或同步的四人小组。然而,Dec-POMDP比单一试剂问题更难解决,前者是国家执行项目完成者,后者是MDP,仅是P-完成者。因此,Dec-POMDP目前的RL算法样本复杂程度很低,因此在环境互动费用昂贵的情况下,降低其适用于实际问题的可能性。我们的主要见解是,仅仅使用多数值的样本,就可以学习一种跨越不同政策的集中模式。然后,我们可以在所学的模型中优化政策,而不是真正的系统,减少环境互动的次数。我们还在我们的模式中学习了一种集中探索政策,学会在模型不确定性很高的州-行动地区收集更多数据。最后,我们从经验上评价了三种合作通信任务中拟议的模型算法,即MARCO,其中将样本效率提高到20x。