While reinforcement learning algorithms provide automated acquisition of optimal policies, practical application of such methods requires a number of design decisions, such as manually designing reward functions that not only define the task, but also provide sufficient shaping to accomplish it. In this paper, we view reinforcement learning as inferring policies that achieve desired outcomes, rather than as a problem of maximizing rewards. To solve this inference problem, we establish a novel variational inference formulation that allows us to derive a well-shaped reward function which can be learned directly from environment interactions. From the corresponding variational objective, we also derive a new probabilistic Bellman backup operator and use it to develop an off-policy algorithm to solve goal-directed tasks. We empirically demonstrate that this method eliminates the need to hand-craft reward functions for a suite of diverse manipulation and locomotion tasks and leads to effective goal-directed behaviors.
翻译:虽然强化学习算法可以自动获得最佳政策,但实际应用这些方法需要一系列设计决定,例如手工设计奖励功能,不仅界定任务,而且为完成任务提供了足够的塑造。在本文件中,我们认为强化学习是推导政策,实现预期结果,而不是实现最大回报的问题。为解决这一推论问题,我们建立了一个新的变式推论配方,使我们能够从环境互动中直接学到一个形状良好的奖赏功能。从相应的变式目标中,我们还产生了一个新的概率性贝尔曼后备操作员,并用它来开发一种非政策性算法,以解决目标导向的任务。我们从经验上证明,这种方法消除了各种操纵和移动任务组合中手动奖赏功能的必要性,并导致有效的目标导向行为。