Agents that can learn to imitate given video observation -- \emph{without direct access to state or action information} are more applicable to learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function comparing an agent's behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improves policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and a quadruped and a humanoid in 3D. We show that our method outperforms current state-of-the-art techniques in these environments and can learn to imitate from a single video demonstration.
翻译:能够学习模仿视频观测 -- -- \ emph{ 没有直接访问状态或行动信息的代理商 能够学习模仿给定视频观测 -- -- \ emph{ 没有直接访问状态或行动信息 的代理商 ) 更适用于自然界的学习。 但是, 制定一个促进这一目标的强化学习(RL) 代理商 仍是一个重大挑战。 我们使用对比性培训来应对这一挑战, 学习一种奖赏功能, 将代理商的行为与单一演示进行比较。 我们使用一个暹罗经常性神经网络架构来学习空间和时间之间的奖赏, 同时培训RL 政策以尽量减少这种距离 。 我们还发现, 通过实验, 包含多任务数据和额外的图像编码损失可以提高所学到的奖赏的时间一致性, 从而大大改进了政策学习。 我们展示了我们在2D模拟人类、狗和猛禽剂以及3D中四重体和人类类方面的做法。 我们展示了我们的方法优于这些环境中目前最先进的技术, 我们可以学习从单一的视频演示中模仿。