Representation learning and exploration are among the key challenges for any deep reinforcement learning agent. In this work, we provide a singular value decomposition based method that can be used to obtain representations that preserve the underlying transition structure in the domain. Perhaps interestingly, we show that these representations also capture the relative frequency of state visitations, thereby providing an estimate for pseudo-counts for free. To scale this decomposition method to large-scale domains, we provide an algorithm that never requires building the transition matrix, can make use of deep networks, and also permits mini-batch training. Further, we draw inspiration from predictive state representations and extend our decomposition method to partially observable environments. With experiments on multi-task settings with partially observable domains, we show that the proposed method can not only learn useful representation on DM-Lab-30 environments (that have inputs involving language instructions, pixel images, and rewards, among others) but it can also be effective at hard exploration tasks in DM-Hard-8 environments.
翻译:表示学习和探索对于任何深度强化学习代理来说都是关键挑战。在这项工作中,我们提供了一种基于奇异值分解的方法,可以用于获取保留域中基础转移结构的表示。有趣的是,我们展示了这些表示还捕获了状态访问的相对频率,从而免费提供了伪计数的估计量。为了将此分解方法扩展到大规模的领域,我们提供了一种无需建立转移矩阵的算法,可以利用深度网络,并且还允许小批量训练。此外,我们从预测状态表示中得到启发,并扩展了我们的分解方法以适应部分可观察环境。通过在部分可观察领域上进行的多任务设置的实验,我们展示了所提出的方法不仅可以在涉及语言指令、像素图像和奖励等多个输入的 DM-Lab-30 环境中学习有用的表示,而且还可以在 DM-Hard-8 环境中有效地执行难度较高的探索任务。