We propose world value functions (WVFs), a type of goal-oriented general value function that represents how to solve not just a given task, but any other goal-reaching task in an agent's environment. This is achieved by equipping an agent with an internal goal space defined as all the world states where it experiences a terminal transition. The agent can then modify the standard task rewards to define its own reward function, which provably drives it to learn how to achieve all reachable internal goals, and the value of doing so in the current task. We demonstrate two key benefits of WVFs in the context of learning and planning. In particular, given a learned WVF, an agent can compute the optimal policy in a new task by simply estimating the task's reward function. Furthermore, we show that WVFs also implicitly encode the transition dynamics of the environment, and so can be used to perform planning. Experimental results show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods to further improve sample efficiency.
翻译:我们提出世界价值函数(WVFs),这是一种面向目标的一般价值函数,它代表着如何在代理人的环境中解决一项特定任务,以及任何其他具有目标意义的任务。这是通过为代理人配备一个内部目标空间来实现的,该代理人被定义为全世界都处于过渡阶段的国家。然后,该代理人可以修改标准任务奖励,以界定其自身的奖励功能,这可以明显地推动它学习如何实现所有可实现的内部目标,以及在当前任务中这样做的价值。我们展示了WVFs在学习和规划方面的两个关键好处。特别是,考虑到所学的WVF,一个代理人可以通过简单地估计任务的报酬功能来计算一项新任务的最佳政策。此外,我们表明WVFs还隐含了环境转型动态,因此可以用来进行规划。实验结果显示,WVFs能够比正常的价值功能更快地学习,而它们预测环境动态的能力可以用来整合学习和规划方法,以进一步提高抽样效率。