In computational reinforcement learning, a growing body of work seeks to construct an agent's perception of the world through predictions of future sensations; predictions about environment observations are used as additional input features to enable better goal-directed decision-making. An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making. This challenge is especially apparent in continual learning problems where a single stream of experience is available to a singular agent. As a primary contribution, we introduce a meta-gradient descent process by which an agent learns 1) what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward -- all during a single ongoing process of continual learning. In this manuscript we consider predictions expressed as General Value Functions: temporally extended estimates of the accumulation of a future signal. We demonstrate that through interaction with the environment an agent can independently select predictions that resolve partial-observability, resulting in performance similar to expertly specified GVFs. By learning, rather than manually specifying these predictions, we enable the agent to identify useful predictions in a self-supervised manner, taking a step towards truly autonomous systems.
翻译:在计算强化学习中,越来越多的工作力求通过预测未来感知来构建代理人对世界的看法;对环境观测的预测被用作补充投入特征,以便更好地进行目标导向的决策;这项工作的一个公开挑战,是从代理人可能作出哪些预测可能最好地支持决策的无限多预测中决定的。这一挑战特别明显地表现在持续学习的问题,因为一个单一的代理人可以得到单一的经验流。作为主要贡献,我们引入了一个元化的下降过程,使代理人学会了1)什么预测,2 其选择的预测估计数,3) 如何利用这些估计数来产生使未来奖励最大化的政策 -- -- 所有这些都是在不断不断学习的单一过程中进行的。在这个手稿中,我们把预测视为一般价值功能:对未来信号积累的时间较长的估计。我们证明,通过与环境的互动,代理人可以独立地选择解决不完全易懂性的预测,从而实现与专业指定的GVFs类似的业绩。我们通过学习而不是手动地说明这些预测,使得代理人能够以真正的自主步骤确定有用的预测。