从视觉输入中学习 (Off-policy Imitation Learning from Visual Inputs)

Recently, various successful applications utilizing expert states in imitation learning (IL) have been witnessed. However, another IL setting -- IL from visual inputs (ILfVI), which has a greater promise to be applied in reality by utilizing online visual resources, suffers from low data-efficiency and poor performance resulted from an on-policy learning manner and high-dimensional visual inputs. We propose OPIfVI (Off-Policy Imitation from Visual Inputs), which is composed of an off-policy learning manner, data augmentation, and encoder techniques, to tackle the mentioned challenges, respectively. More specifically, to improve data-efficiency, OPIfVI conducts IL in an off-policy manner, with which sampled data can be used multiple times. In addition, we enhance the stability of OPIfVI with spectral normalization to mitigate the side-effect of off-policy training. The core factor, contributing to the poor performance of ILfVI, that we think is the agent could not extract meaningful features from visual inputs. Hence, OPIfVI employs data augmentation from computer vision to help train encoders that can better extract features from visual inputs. In addition, a specific structure of gradient backpropagation for the encoder is designed to stabilize the encoder training. At last, we demonstrate that OPIfVI is able to achieve expert-level performance and outperform existing baselines no matter visual demonstrations or visual observations are provided via extensive experiments using DeepMind Control Suite.

翻译：最近,人们见证了利用专家在模仿学习(IL)过程中使用专家状态的各种成功应用。然而,另一个IL设置 -- -- 从视觉投入(ILFVI)中获取的IL(ILFVI),它更希望通过利用在线视觉资源在现实中应用,但数据效率低,业绩差,原因是在政策上学习的方式和高维视觉投入导致数据效率低下。我们建议OPIFVI(来自视觉投入的非政策模仿),它由非政策学习方式、数据增强和编码技术组成,分别用来应对上述挑战。更具体地说,为了提高数据效率,OPIFVI以非政策性方式实施IL(ILFVI),它有更大的希望通过使用在线视觉资源在现实应用上应用应用,而抽样数据可以多次使用。此外,我们用光谱化的常规培训结构来减轻离政策培训的副作用。我们认为,造成ILFVI工作表现不良的核心因素是无法从视觉投入中提取有意义的特征。因此,OPIFVI利用计算机视野的增强数据,以便培训更准确地从视觉观测中提取内容。此外,另一个结构是用来进行稳定到通过直观测试。