自立轨迹自动编码器:带轨迹嵌入式的等级强化学习 (Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings)

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

翻译：在这项工作中,我们从代表学习的角度看待等级强化学习,在等级中学习低层的问题被转化成学习轨迹层次的基因模型的问题。我们表明,我们可以学习轨迹的连续潜在代表,这些轨迹能够有效地解决时间延伸和多阶段的问题。我们提议的模型SETAR,从变异自动转换器中汲取灵感,并学习轨迹的潜在表现。这一方法的一个关键组成部分是既学习一种潜质政策,又学习一种相互一致的潜质模型。根据同样的潜质,该政策产生一种轨迹,应该与模型预测的轨迹相匹配。这一模型提供了一种内在的预测机制,通过预测封闭循环政策行为的结果。我们提出了一种新的算法,用这一模型来进行分级的RL,将学习的潜伏层空间的模型规划与一个非超强的探索目标结合起来。我们表明,我们的模型能够有效地在长视野上进行推理,对几项模拟任务给予微的奖励,超越了标准的强化学习方法,以及先前的等级推理、基于模型的规划和探索方法。