使用查问和反样板进行活性极量回授自动图解解导和强化学习 (Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples)

Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.

翻译：尽管深层强化学习(RL)已经超越了人类在各种任务方面的表现,但它仍然有一些根本性的挑战。首先,大多数RL方法需要来自环境探索的密集数据才能取得令人满意的业绩。第二,在RL中使用神经网络使得很难以人类能够理解的方式解释系统中的内部内容。为了应对这两项挑战,我们提出了一个框架,使RL代理能够解释其探索过程,并提取高水平知识,以便有效指导其未来探索。具体地说,我们提出一个新的RL算法,通过使用L* 学习算法来学习高层次知识,以定额奖励自动马顿。我们证明,在Saddic RL中,定额奖励自动马顿可以表示任何非马可维的奖赏功能,其奖赏价值有限,而且近似于任何非马可维的奖赏功能(其奖赏价值极多),我们提出的RL算法方法比我们越来越老,因此,拟议的RL方法几乎必然会与一个最优的不精细的ARRL算法。我们用两种RL政策的方法来比较一个不精细的RL 。