Due to information asymmetry, finding optimal policies for Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) is hard with the complexity growing doubly exponentially in the horizon length. The challenge increases greatly in the multi-agent reinforcement learning (MARL) setting where the transition probabilities, observation kernel, and reward function are unknown. Here, we develop a general compression framework with approximate common and private state representations, based on which decentralized policies can be constructed. We derive the optimality gap of executing dynamic programming (DP) with the approximate states in terms of the approximation error parameters and the remaining time steps. When the compression is exact (no error), the resulting DP is equivalent to the one in existing work. Our general framework generalizes a number of methods proposed in the literature. The results shed light on designing practically useful deep-MARL network structures under the "centralized learning distributed execution" scheme.
翻译:由于信息不对称,要找到分散的可部分观察的Markov决策程序(Dec-POMDPs)的最佳政策是困难的,因为复杂程度在地平线长度上倍增,挑战在多剂强化学习(MARL)的设置上大大增加,因为过渡概率、观察内核和奖励功能尚不清楚。在这里,我们制定了一个总压缩框架,有大约共同和私人的国家代表机构,可以据此制定分散的政策。我们得出执行动态方案(DP)的最佳性差距,在近似误差参数和剩余时间步骤方面有近似状态。当压缩准确(没有错误)时,由此产生的DP等同于现有工作。我们的一般框架概括了文献中提议的一些方法。结果说明了在“集中学习分布执行”计划下设计实用的深MARL网络结构。