A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. It is generally impossible to anticipate all situations that an autonomous system may face or what behavior would best avoid bad outcomes. An agent that could learn to be cautious would overcome this challenge by discovering for itself when and how to behave cautiously. In contrast, current approaches typically embed task-specific safety information or explicit cautious behaviors into the system, which is error-prone and imposes extra burdens on practitioners. In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to \emph{learn} to be cautious. The essential features of our algorithm are that it characterizes reward function uncertainty without task-specific safety information and uses this uncertainty to construct a robust policy. Specifically, we construct robust policies with a $k$-of-$N$ counterfactual regret minimization (CFR) subroutine given a learned reward function uncertainty represented by a neural network ensemble belief. These policies exhibit caution in each of our tasks without any task-specific safety tuning.
翻译:强化学习领域的一个关键挑战是培养在新情况中谨慎行事的代理人;一般不可能预测自主系统可能面临的所有情况或什么行为最能避免不良结果;能够学会谨慎行事的代理人通过发现自己何时和如何谨慎行事来克服这一挑战;相比之下,目前的做法通常是将特定任务的安全信息或明显的谨慎行为嵌入系统,这容易出错,给从业者带来额外负担。在本文中,我们既提出一系列任务,其中谨慎行为越来越不明显,也提出算法,以证明一个系统有可能避免不良结果。我们的算法的基本特征是,它在没有特定任务的安全信息的情况下,将奖励功能的不确定性定性,并利用这种不确定性来构建一个强有力的政策。具体地说,我们制定强有力的政策,以美元为单位,将实际的遗憾降到最低程度,因为通过一个神经网络的信念可以产生学习的奖励功能的不确定性。这些政策在每一项任务中都表现出谨慎态度,而没有任何具体任务的安全调整。