Humans learn to imitate by observing others. However, robot imitation learning generally requires expert demonstrations in the first-person view (FPV). Collecting such FPV videos for every robot could be very expensive. Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do. This ultimately allows utilizing human and robot demonstration videos in TPV from many different data sources, for the policy learning. In this paper, we present a TPIL approach for robot tasks with egomotion. Although many robot tasks with ground/aerial mobility often involve actions with camera egomotion, study on TPIL for such tasks has been limited. Here, FPV and TPV observations are visually very different; FPV shows egomotion while the agent appearance is only observable in TPV. To enable better state learning for TPIL, we propose our disentangled representation learning method. We use a dual auto-encoder structure plus representation permutation loss and time-contrastive loss to ensure the state and viewpoint representations are well disentangled. Our experiments show the effectiveness of our approach.
翻译:人类通过观察他人学习模仿他人。 然而, 机器人模仿学习通常需要第一人视角的专家演示。 收集每个机器人的FPV视频可能非常昂贵。 第三人模拟学习( TPIL) 是学习行动政策的概念, 通过观察第三人视角中的其他代理( TPV) 来观察与人类相似的第三人视角中的其他代理( TPV) 。 这最终允许从许多不同的数据源中, 将人类和机器人演示视频用于 TPV 中, 用于政策学习 。 在本文中, 我们展示了一种自动机器人任务 的 TPIL 方法 。 虽然许多 地面/ 空中移动的机器人任务往往涉及相机自动动作, 但关于 TPIL 的这类任务的研究却有限。 这里, FPV 和 TPV 的观察非常不同 ; FPV 显示自动性, 而代理的外观只是在 TPV 外观中可见 。 为了让TPIL 更好地进行状态学习, 我们建议了我们分解的代表学习方法 。 我们使用一种双重的自动解码结构结构, 外观感应变损失 。 我们的实验展示了我们的方法 。