Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirically results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models.
翻译:强化学习算法需要大量样本;这往往限制其真实世界的应用,甚至简单的任务。这种挑战在多试剂任务中更为突出,因为每个操作步骤都需要通信或转移或资源,费用更高,因为每个操作步骤都更需要通信或转移或资源。这项工作的目的是通过示范学习提高多试剂控制的数据效率。我们认为,在网络系统中,代理商与邻居合作,并且只在当地与邻居沟通,并提议分散模式政策优化框架(DMPO)。在我们的方法中,每个代理商学习一个动态模型,以预测未来状态,通过通信广播其预测,然后根据示范推出方案培训政策。为了减轻模型生成数据的偏差,我们限制模型使用模型使用以产生近似推出的偏差,从而减少模型生成的复合错误。为了与政策更新的独立性相关,我们引入了扩展值功能,并从理论上证明由此产生的政策梯度与真正的政策梯度十分接近。我们评估了我们关于智能运输系统的若干基准的算法,这些基准是连接的自动车辆控制任务(Flow和CACC)和适应性交通信号控制(ATSC)。使用真实的模型显示,我们的方法实现了高效数据效率,而真正匹配。