Given a dataset of expert agent interactions with an environment of interest, a viable method to extract an effective agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is mean-seeking with respect to the state-conditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we introduce a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.
翻译:鉴于专家代理人与感兴趣环境之间相互作用的数据集,一个可行的方法,是得出有效的代理政策,以估计该数据显示的最大可能性政策。这种方法通常被称为行为性克隆(BC)。在这项工作中,我们描述了由于最大可能性目标功能而导致的不列颠哥伦比亚的一个主要不利之处;即当学习者的政策与Gaussian代表时,不列颠哥伦比亚州在州条件专家行动分布方面是暗中寻求的。为了解决这一问题,我们引入了一个修改版的BC,Adversarial Behaviral Cloinning(ABC),通过纳入GAN(典型对抗网络)培训的内容来显示寻求模式的行为。我们在玩具域和基于Hopper的DeepMind控制套件域上评价ABC,并表明它由于是寻求模式的性质而不符合标准的不列颠哥伦比亚标准。