We develop Upside-Down Reinforcement Learning (UDRL), a method for learning to act using only supervised learning techniques. Unlike traditional algorithms, UDRL does not use reward prediction or search for an optimal policy. Instead, it trains agents to follow commands such as "obtain so much total reward in so much time." Many of its general principles are outlined in a companion report; the goal of this paper is to develop a practical learning algorithm and show that this conceptually simple perspective on agent training can produce a range of rewarding behaviors for multiple episodic environments. Experiments show that on some tasks UDRL's performance can be surprisingly competitive with, and even exceed that of some traditional baseline algorithms developed over decades of research. Based on these results, we suggest that alternative approaches to expected reward maximization have an important role to play in training useful autonomous agents.
翻译:我们开发了“上下强化学习”(UDRL),这是一种学习仅使用监督学习技术的学习方法。与传统算法不同,UDRL并不使用奖励预测或寻找最佳政策。相反,它培训代理商遵循“在如此长的时间里获得如此多的全部奖励”等指令。 其许多一般原则在一份配套报告中得到了概述;本文件的目的是开发实用学习算法,并表明这种关于代理培训的概念简单化的观点可以为多种零散的环境产生一系列有益的行为。 实验显示,在某些任务上UDRL的表现可以令人惊讶地与几十年研究所形成的一些传统基线算法竞争,甚至超过这些结果。 基于这些结果,我们建议,预期奖励最大化的替代方法在培训有用的自主代理商方面可以发挥重要作用。