Traditionally, reinforcement learning methods predict the next action based on the current state. However, in many situations, directly applying actions to control systems or robots is dangerous and may lead to unexpected behaviors because action is rather low-level. In this paper, we propose a novel hierarchical reinforcement learning framework without explicit action. Our meta policy tries to manipulate the next optimal state and actual action is produced by the inverse dynamics model. To stabilize the training process, we integrate adversarial learning and information bottleneck into our framework. Under our framework, widely available state-only demonstrations can be exploited effectively for imitation learning. Also, prior knowledge and constraints can be applied to meta policy. We test our algorithm in simulation tasks and its combination with imitation learning. The experimental results show the reliability and robustness of our algorithms.
翻译:传统上,强化学习方法预测基于当前状态的下一步行动。 但是,在许多情况下,直接对控制系统或机器人采取行动是危险的,并可能导致出乎意料的行为,因为行动相当低。 在本文中,我们提出一个新的等级强化学习框架,而没有明确行动。我们的元政策试图操纵下一个最佳状态和实际行动,这是由反动态模型产生的。为了稳定培训过程,我们把对抗学习和信息瓶颈纳入我们的框架中。在我们的框架内,可以有效地利用广泛存在的国有演示进行模仿学习。此外,以前的知识和制约因素也可以应用于元政策。我们在模拟任务中测试我们的算法及其与模拟学习的结合。实验结果显示了我们算法的可靠性和可靠性。