深强化学习中自我监督探索的变异动态 (Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning)

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

翻译：在强化学习方面,特别是在环境的外部回报很少甚至完全被忽视的情况下,高效的探索仍然是一项具有挑战性的问题。基于内在动力的重大进步表明在简单环境中取得了有希望的结果,但往往被困在多式和随机动态的环境中。在这项工作中,我们提出了一个基于模拟多式联运和随机性的有条件的变异推论的变异动态模型。我们认为环境状态行动过渡是一种有条件的基因化过程,在目前状态、行动和潜在变异的条件下进行下一个国家的预测,从而更好地了解动态并导致更好的勘探业绩。我们从环境转型的负日志相似性中获取了一个上层界限,并使用一个作为内在勘探奖赏的顶层,使代理人能够在不观察极端奖赏的情况下通过自我监督的勘探学习技能。我们评估了若干基于图像的模拟任务和真正的机器人操纵任务的拟议方法。我们的方法优于几种基于环境模型的状态探索方法。