This paper investigates the model-based methods in multi-agent reinforcement learning (MARL). We specify the dynamics sample complexity and the opponent sample complexity in MARL, and conduct a theoretic analysis of return discrepancy upper bound. To reduce the upper bound with the intention of low sample complexity during the whole learning process, we propose a novel decentralized model-based MARL method, named Adaptive Opponent-wise Rollout Policy Optimization (AORPO). In AORPO, each agent builds its multi-agent environment model, consisting of a dynamics model and multiple opponent models, and trains its policy with the adaptive opponent-wise rollout. We further prove the theoretic convergence of AORPO under reasonable assumptions. Empirical experiments on competitive and cooperative tasks demonstrate that AORPO can achieve improved sample efficiency with comparable asymptotic performance over the compared MARL methods.
翻译:本文调查了多试剂强化学习(MARL)中基于模型的方法。我们具体说明了MARL中的动态样本复杂性和对手样本复杂性,并对返回偏差的上限进行了理论分析。为了在整个学习过程中减少上层约束,并意图在低样本复杂性方面在整个学习过程中采取新的分散型模型MARL方法,名为适应性对口推出政策优化(AORPO ) 。在AORPO中,每个代理都建立了由动态模型和多个对手模型组成的多试样环境模型,并用适应性对口模型来培训其政策。我们进一步证明AORPO在合理假设下理论融合了AORPO的理论。关于竞争性和合作性任务的经验实验表明,AORPO能够提高样本效率,在比较MARL方法时可以与类似性性性表现。