Supervised regression to demonstrations has been demonstrated to be a stable way to train deep policy networks. We are motivated to study how we can take full advantage of supervised loss functions for stably training deep reinforcement learning agents. This is a challenging task because it is unclear how the training data could be collected to enable policy improvement. In this work, we propose Self-Supervised Reinforcement Learning (SSRL), a simple algorithm that optimizes policies with purely supervised losses. We demonstrate that, without policy gradient or value estimation, an iterative procedure of ``labeling" data and supervised regression is sufficient to drive stable policy improvement. By selecting and imitating trajectories with high episodic rewards, SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time, showing the potential of solving reinforcement learning with supervised learning techniques. The code is available at https://github.com/daochenzha/SSRL
翻译:监督下回归示范活动已证明是培训深层政策网络的一种稳定方式。我们积极研究如何充分利用监督下损失功能,以进行深层强化学习人员培训。这是一项具有挑战性的任务,因为尚不清楚如何收集培训数据,以便改进政策。在这项工作中,我们提议采用自我监督强化学习(SSRL)这一简单的算法,以纯粹监督的损失优化政策。我们证明,如果没有政策梯度或价值估计,“标签”数据和监督下回归的迭接程序就足以推动稳定的政策改进。通过选择和模仿具有高度累积性奖赏的轨迹,SSRL对表现更稳定、运行时间更少的当代算法具有惊人的竞争力,显示了用监管的学习技术解决强化学习的潜力。我们可在https://github.com/daochenzha/SSRL查阅该代码。