具有自我预防代表的高效数据强化学习 (Data-Efficient Reinforcement Learning with Self-Predictive Representations)

While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations(SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. The code associated with this work is available at https://github.com/mila-iqia/spr

翻译：深度强化学习在解决能够通过几乎没有限制地与环境互动来收集大量数据的任务方面是卓越的,但从有限的互动中学习仍然是一项关键的挑战。我们假设,如果我们根据视觉投入的结构以及与环境的相继互动来增加基于自我监督目标的奖励,则代理人可以更高效地学习。我们的方法,即自我预测代表制(SPR),培训代理人来预测自己的潜伏状态,进入未来的多个步骤。我们用一个编码器来计算未来国家的目标代表,该编码器是该代理人参数的指数移动平均数,我们利用一个学习的过渡模型作出预测。就其本身而言,这一未来预测目标比以前采用的方法更符合基于其视觉投入的结构以及与环境相接轨的样本高效深度RL的方法。我们通过增加未来预测损失的数据来进一步改进业绩,迫使代理人的表述在多个观察观点中保持一致。我们完全自我监督的目标,即将未来预测和数据增强结合起来,在将Atari(Atari)的中位数为0.415,我们用一个有限的环境步骤来进行预测。这个未来的预测目标比以前有100公里级的转换。在Slishelf-qual上,这代表了55 % 相对改进了之前的Smartal的工作比前一个有一定的系统。比比前一个有25次的数据比比比比重的数据超过了26。在Sentxxxxxxxxxxxxxx。