As one of the solutions to the decentralized partially observable Markov decision process (Dec-POMDP) problems, the value decomposition method has achieved significant results recently. However, most value decomposition methods require the fully observable state of the environment during training, but this is not feasible in some scenarios where only incomplete and noisy observations can be obtained. Therefore, we propose a novel value decomposition framework, named State Inference for value DEcomposition (SIDE), which eliminates the need to know the global state by simultaneously seeking solutions to the two problems of optimal control and state inference. SIDE can be extended to any value decomposition method to tackle partially observable problems. By comparing with the performance of different algorithms in StarCraft II micromanagement tasks, we verified that though without accessible states, SIDE can infer the current state that contributes to the reinforcement learning process based on past local observations and even achieve superior results to many baselines in some complex scenarios.
翻译:作为分散部分可见的Markov决定过程(Dec-POMDP)问题的解决办法之一,价值分解方法最近取得了显著成果,然而,大多数价值分解方法要求培训期间环境的完全可见状态,但在某些只能获得不完整和噪音观测的情景下,这不可行。因此,我们提议了一个新的价值分解框架,即称为国家分解值推论,即价值分解(SIDE),通过同时寻找最佳控制和状态推断这两个问题的解决办法,消除了解全球状态的必要性。SIDE可以扩展至任何价值分解方法,以解决部分可见的问题。我们通过比较StarCraft II微观管理任务中不同算法的性能,我们核实SIDE虽然没有无障碍状态,但可以推断目前的状况,这种状态有助于根据以往的当地观察加强学习进程,甚至在某些复杂情景下对许多基线取得优异的结果。