This paper proposes DeepSynth, a method for effective training of deep Reinforcement Learning (RL) agents when the reward is sparse and non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton so that the generation of a control policy by deep RL is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse non-Markovian rewards. We have evaluated DeepSynth's performance in a set of experiments that includes the Atari game Montezuma's Revenge. Compared to existing approaches, we obtain a reduction of two orders of magnitude in the number of iterations required for policy synthesis, and also a significant improvement in scalability.
翻译:本文提出深强化学习(RL)代理器的有效培训方法DeepSynth, 这是一种在奖赏稀少、非马尔科文时有效培训深强化学习(RL)代理器的方法, 但与此同时, 奖赏的进展需要达到一个未知的高层目标序列。 我们的方法是使用一个新型的集成自动自动成像集集集集集集集集, 来自动发现这一相继结构。 我们从通过探索环境收集的微量数据中合成出一个人类解析的自动图案。 然后, 以合成的自动图集丰富了环境空间, 使得深RL制定的控制政策以在自动图集中发现的编码结构为指导。 提议的方法既能应对高维、低度特性,又能应付未知的稀有的非马尔科文奖状。 我们评估了DeepSynth在一系列实验中的性能, 其中包括Atari游戏 Monezuma的Revenge。 与现有方法相比, 我们减少了政策综合所需的迭代数的两级级级, 也大大改进了可伸缩性。