Training deep reinforcement learning agents on environments with multiple levels / scenes from the same task, has become essential for many applications aiming to achieve generalization and domain transfer from simulation to the real world. While such a strategy is helpful with generalization, the use of multiple scenes significantly increases the variance of samples collected for policy gradient computations. Current methods, effectively continue to view this collection of scenes as a single Markov decision process (MDP), and thus learn a scene-generic value function V(s). However, we argue that the sample variance for a multi-scene environment is best minimized by treating each scene as a distinct MDP, and then learning a joint value function V(s,M) dependent on both state s and MDP M. We further demonstrate that the true joint value function for a multi-scene environment, follows a multi-modal distribution which is not captured by traditional CNN / LSTM based critic networks. To this end, we propose a dynamic value estimation (DVE) technique, which approximates the true joint value function through a sparse attention mechanism over multiple value function hypothesis / modes. The resulting agent not only shows significant improvements in the final reward score across a range of OpenAI ProcGen environments, but also exhibits enhanced navigation efficiency and provides an implicit mechanism for unsupervised state-space skill decomposition.
翻译:在同一任务中具有多个层次/场景的环境的深入强化培训学习人员,对于许多旨在实现从模拟到现实世界的普及和域转移的应用程序来说,已经变得至关重要。虽然这种战略有助于一般化,但多场面的使用大大增加了为政策梯度计算而收集的样本的差异。目前的方法有效地继续将这种场面的收集视为一个单一的Markov决策程序(MDP),从而学习一个场景-典型值函数V(s)。然而,我们认为,将每个场景作为不同的 MDP 处理,然后学习一个依赖于州和MDP M的联合值函数V(s,M)。 我们进一步证明,多场景环境的真正联合价值函数是多场面环境的一种多种模式分布,而传统的CNN/LSTM的批评网络没有记录到这种模式。我们提议一种动态价值估计技术,它通过多种价值假设/模式的微弱关注机制来接近真正的联合价值功能。 由此产生的代理商不仅展示了对州级和MDP M.M.M.M.M.M.的联合价值联合值功能,而且还展示了对州最终评级的开放和不深层空间定位的升级技术。