Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function - a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.
翻译:许多实际世界强化学习(RL)问题要求学习复杂、时间跨度长的行为,当行为完成时可能只能得到奖赏信号。如果知道值得奖励的行为,可以用非马尔科维尼亚奖赏功能来说明。这个功能取决于州-行动历史的方方面面,而不仅仅是当前状态和行动。这种奖励功能产生微弱的回报,需要大量经验来寻找一种能够捕捉值得奖励的行为模式的政策。最近的工作利用了知识代表(KR)来提供一种象征性的抽象的抽象的状态,以总结州-行动历史中与奖赏相关的属性,并支持学习Markovian解析问题在KR的自动地图上的位置。提供这种解析已经证明可以极大地提高学习率,特别是在与利用自动地图结构的算法相结合的情况下。然而,这种技术依赖于KRR的先天知识。在这项工作中,我们探索如何自动发现有用的状态抽象信息,支持在州-行动史上学习与奖赏相关的属性属性属性,并支持学习Markovian的特性特性。结果是一个简单的末-马可分析法,在最短于州-最低的域中学习最优化环境。