In practical applications, we can rarely assume full observability of a system's environment, despite such knowledge being important for determining a reactive control system's precise interaction with its environment. Therefore, we propose an approach for reinforcement learning (RL) in partially observable environments. While assuming that the environment behaves like a partially observable Markov decision process with known discrete actions, we assume no knowledge about its structure or transition probabilities. Our approach combines Q-learning with IoAlergia, a method for learning Markov decision processes (MDP). By learning MDP models of the environment from episodes of the RL agent, we enable RL in partially observable domains without explicit, additional memory to track previous interactions for dealing with ambiguities stemming from partial observability. We instead provide RL with additional observations in the form of abstract environment states by simulating new experiences on learned environment models to track the explored states. In our evaluation, we report on the validity of our approach and its promising performance in comparison to six state-of-the-art deep RL techniques with recurrent neural networks and fixed memory.
翻译:在实际应用中,我们很少能够认为一个系统的环境完全可受到注意,尽管这种知识对于确定反应式控制系统与其环境的精确互动十分重要。因此,我们提议了一个在部分可观测环境中加强学习的方法。假设环境行为类似于部分可见的马尔科夫决定过程,并有已知的离散行动,我们假定环境没有对其结构或过渡概率的任何了解。我们的方法将Q-学习与学习Markov决定程序的方法IoAlergia(MDP)结合起来。通过从RL代理器的片段学习环境MDP模型,我们使部分可观测域的RL能够追踪以前因处理部分可观测性产生的模糊不清而发生的交互作用,而没有明确的、额外的记忆。我们以抽象环境状态的形式提供更多的观察,方法是模拟关于跟踪所探讨的状态的学习环境模型的新经验。在我们的评价中,我们报告我们的方法的有效性及其前景,与具有经常性神经网络和固定记忆的六种最先进的深RL技术相比。