While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we discuss a new perspective on reinforcement learning, recasting it as the problem of inferring actions that achieve desired outcomes, rather than a problem of maximizing rewards. To solve the resulting outcome-directed inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator reminiscent of the standard Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to design reward functions and leads to effective goal-directed behaviors.
翻译:虽然强化学习算法可以自动获得最佳政策,但实际应用这些方法需要一系列设计决定,例如手工设计奖励功能,不仅界定任务,而且为完成任务提供了足够的塑造。在本文中,我们讨论了强化学习的新视角,将其改写为推论实现预期结果的行动问题,而不是最大收益的问题。为了解决由此产生的结果导向推论问题,我们建立了一个新的变式推论公式,使我们能够产生一种能够直接从环境互动中学习的形状良好的奖赏功能。从相应的变式目标中,我们还产生了一个新的概率性Bellman后备操作员,与标准的Bellman备份操作员相类似,并用它来开发一种政策外算法,解决目标导向的任务。我们从经验上证明,这种方法消除了设计奖励功能的必要性,并导致有效的目标导向行为。