使用变式模型进行视觉反光模拟学习 (Visual Adversarial Imitation Learning using Variational Models)

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at \url{https://sites.google.com/view/variational-mail}.

翻译：需要大量人力努力和迭代才能完成重任功能规范,这仍然是通过深层强化学习学习行为的主要障碍。相反,提供理想行为的视觉演示往往为教代理人提供了更简单、更自然的方法。我们考虑一个设置,为代理提供固定的视觉演示数据集,说明如何执行任务,并且必须学会使用所提供的演示和不受监督的环境互动来完成任务。这一设置提出了若干挑战,包括视觉观测的代表学习、高维度空间的样本复杂性以及由于缺乏固定的奖赏或学习信号而学习不稳定性。在应对这些挑战时,我们开发了一个基于变式模型的对抗性模仿学习(V-MAIL)算法。基于模型的方法为代表学习提供了强有力的信号,使样本效率得以实现,并通过扶持性政策学习来提高对抗性培训的稳定性。通过若干基于视觉的移动和操作性任务的实验,我们发现V-MAIL以抽样高效的方式学习成功的超额超额版机政策,比以前的工作更稳定,并且还可以实现更高水平的视觉模拟模拟模拟学习(包括在线学习任何视觉演示)。