In value-based reinforcement learning (RL), unlike in supervised learning, the agent faces not a single, stationary, approximation problem, but a sequence of value prediction problems. Each time the policy improves, the nature of the problem changes, shifting both the distribution of states and their values. In this paper we take a novel perspective, arguing that the value prediction problems faced by an RL agent should not be addressed in isolation, but rather as a single, holistic, prediction problem. An RL algorithm generates a sequence of policies that, at least approximately, improve towards the optimal policy. We explicitly characterize the associated sequence of value functions and call it the value-improvement path. Our main idea is to approximate the value-improvement path holistically, rather than to solely track the value function of the current policy. Specifically, we discuss the impact that this holistic view of RL has on representation learning. We demonstrate that a representation that spans the past value-improvement path will also provide an accurate value approximation for future policy improvements. We use this insight to better understand existing approaches to auxiliary tasks and to propose new ones. To test our hypothesis empirically, we augmented a standard deep RL agent with an auxiliary task of learning the value-improvement path. In a study of Atari 2600 games, the augmented agent achieved approximately double the mean and median performance of the baseline agent.
翻译:在基于价值的强化学习(RL)中,与监督的学习不同,该代理商面临的并不是一个单一的、固定的、近似的问题,而是一系列的价值预测问题。每次政策改善,问题的性质就会发生变化,改变国家及其价值的分配。在本文件中,我们从新的角度看待,认为该代理商面临的价值预测问题不应孤立地处理,而应作为一个单一的、整体的、预测的问题处理。一个RL算法会产生一系列的政策,至少可以大致地改进最佳政策。我们明确确定价值函数的相关序列,称之为增值改进之路。我们的主要想法是全面接近增值路径,而不是仅仅跟踪当前政策的价值功能。具体地说,我们讨论RL代理商的这种整体观点对代表性学习的影响。我们证明,跨越过去的增值改进路径的表述也将为未来的政策改进提供准确的价值近似近。我们利用这一洞察来更好地理解现有的辅助任务方法,并把它称为增值改进路径。我们的主要想法是全面接近增值路径,而不是仅仅跟踪当前政策的价值功能。我们用一个假设性的深层次的高级代理商学习了第26号标准级标准级的成绩。