We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. The unsupervised FB loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality is proportional to the unsupervised approximation error. The FB representation learns long-range relationships between states and actions, via a predictive occupancy map, without having to synthesize states as in model-based approaches. This is a step towards learning controllable agents in arbitrary black-box stochastic environments. This approach compares well to goal-oriented RL algorithms on discrete and continuous mazes, pixel-based MsPacman, and the FetchReach virtual robot arm. We also illustrate how the agent can immediately adapt to new tasks beyond goal-oriented RL.
翻译:我们引入了无报酬的Markov 决策过程的向后代表(FB) 。 它提供了明确的近乎最佳的政策, 任何奖赏都指定了附带的奖赏。 在不受监督的阶段, 我们使用无报酬的与环境互动来通过现成的深层学习方法和时间差异学习两种表现。 在测试阶段, 奖励代表是通过观察或明确的奖赏描述来估计的( 例如, 目标状态 ) 。 奖励的最佳政策直接从这些表现中得来, 没有规划 。 未监督的FB 损失是很好的原则 : 如果培训是完美的, 所获得的政策对于任何奖赏功能来说是可行的最佳的。 在不完善的培训阶段, 亚优程度与非优劣的近似错误成正比。 FB 代表通过预测性占用图来了解各州和行动之间的长期关系, 而不必将各州与基于模型的方法合成。 这是向学习任意黑盒和连续的猎取环境中可控制剂迈出的一步。 这个方法比目标型的RAximal Adrobal 任务更接近于目标式的Rix robal robal robal robal romax