Recent advances in reinforcement-learning research have demonstrated impressive results in building algorithms that can out-perform humans in complex tasks. Nevertheless, creating reinforcement-learning systems that can build abstractions of their experience to accelerate learning in new contexts still remains an active area of research. Previous work showed that reward-predictive state abstractions fulfill this goal, but have only be applied to tabular settings. Here, we provide a clustering algorithm that enables the application of such state abstractions to deep learning settings, providing compressed representations of an agent's inputs that preserve the ability to predict sequences of reward. A convergence theorem and simulations show that the resulting reward-predictive deep network maximally compresses the agent's inputs, significantly speeding up learning in high dimensional visual control tasks. Furthermore, we present different generalization experiments and analyze under which conditions a pre-trained reward-predictive representation network can be re-used without re-training to accelerate learning -- a form of systematic out-of-distribution transfer.
翻译:在强化学习研究方面最近的进展在建立算法方面显示出了令人印象深刻的成果,这些算法能够超越复杂任务中的人类。然而,建立强化学习系统,能够建立其经验的抽象性,以加速在新环境下的学习,这仍然是一个积极的研究领域。先前的工作表明,奖励-预知状态的抽象性实现了这一目标,但只应用于表格设置。在这里,我们提供了一种组合算法,使这种国家抽象性能够应用于深层学习环境,为代理人的投入提供压缩的表述,从而保持了预测奖赏序列的能力。趋同论和模拟表明,由此产生的奖励-预知性深度网络最大限度地压缩了代理人的投入,大大加快了在高维视觉控制任务方面的学习。此外,我们提出了不同的概括性实验和分析,在哪些条件下可以重新使用预先培训的奖赏-预知代表性网络,而不进行再培训,以加速学习 -- -- 一种系统分配外转移的形式。