Decentralized partially observable Markov decision process (DEC-POMDP) models sequential decision making problems by a team of agents. Since the planning of DEC-POMDP can be interpreted as the maximum likelihood estimation for the latent variable model, DEC-POMDP can be solved by the EM algorithm. However, in EM for DEC-POMDP, the forward--backward algorithm needs to be calculated up to the infinite horizon, which impairs the computational efficiency. In this paper, we propose the Bellman EM algorithm (BEM) and the modified Bellman EM algorithm (MBEM) by introducing the forward and backward Bellman equations into EM. BEM can be more efficient than EM because BEM calculates the forward and backward Bellman equations instead of the forward--backward algorithm up to the infinite horizon. However, BEM cannot always be more efficient than EM when the size of problems is large because BEM calculates an inverse matrix. We circumvent this shortcoming in MBEM by calculating the forward and backward Bellman equations without the inverse matrix. Our numerical experiments demonstrate that the convergence of MBEM is faster than that of EM.
翻译:由于DEC-POMDP的规划可以被解释为潜伏变量模型的最大可能性估计,DEC-POMDP可以通过EM算法加以解决。然而,在DEC-POMDP的EM中,前向后向算法需要按照影响计算效率的无限地平线来计算。在本文中,我们提议Bellman EM算法(BEM)和修改后的Bellman EM算法(MBEM),方法是将前向和后向贝尔曼等式引入EM。BEM的效率可能高于EM,因为BEM计算前向和后向贝尔曼等式,而不是远至无限地平线的前向后算法。然而,在问题大的时候,BEM计算出一个反向矩阵时,BEM总比EM效率高。我们通过不采用反向矩阵计算前向和后向贝尔曼等式来绕过MBEM。我们的数字实验表明,MBEM的趋同速度比MEM的趋同速度要快。