We introduce a new unsupervised pre-training method for reinforcement learning called APT, which stands for Active Pre-Training. APT learns behaviors and representations by actively searching for novel states in reward-free environments. The key novel idea is to explore the environment by maximizing a non-parametric entropy computed in an abstract representation space, which avoids the challenging density modeling and consequently allows our approach to scale much better in environments that have high-dimensional observations (e.g., image observations). We empirically evaluate APT by exposing task-specific reward after a long unsupervised pre-training phase. On Atari games, APT achieves human-level performance on 12 games and obtains highly competitive performance compared to canonical fully supervised RL algorithms. On DMControl suite, APT beats all baselines in terms of asymptotic performance and data efficiency and dramatically improves performance on tasks that are extremely difficult to train from scratch.
翻译:我们引入了一个新的强化学习培训前不受监督的新方法,称为APT, 即积极培训前, APT通过积极寻找无报酬环境中的新国家来学习行为和表现。 关键的新想法是探索环境,在抽象的表达空间中最大限度地使用非参数性昆虫,从而避免具有挑战性的密度模型,从而使我们能够在具有高层次观测(例如图像观察)的环境中进行更好的规模评估。 我们从经验上评价APT, 方法是在长期未经监督的培训前阶段之后暴露具体任务的报酬。 在Atari游戏中,APT在12场比赛中取得了人的水平表现,并获得了与完全受监督的RL算法相比具有高度竞争力的性能。 在DM Control套中,APT在无症状性表现和数据效率方面超越了所有基线,并极大地改进了从零开始培训极为困难的任务的绩效。