Behavioral cloning has proven to be effective for learning sequential decision-making policies from expert demonstrations. However, behavioral cloning often suffers from the causal confusion problem where a policy relies on the noticeable effect of expert actions due to the strong correlation but not the cause we desire. This paper presents Object-aware REgularizatiOn (OREO), a simple technique that regularizes an imitation policy in an object-aware manner. Our main idea is to encourage a policy to uniformly attend to all semantic objects, in order to prevent the policy from exploiting nuisance variables strongly correlated with expert actions. To this end, we introduce a two-stage approach: (a) we extract semantic objects from images by utilizing discrete codes from a vector-quantized variational autoencoder, and (b) we randomly drop the units that share the same discrete code together, i.e., masking out semantic objects. Our experiments demonstrate that OREO significantly improves the performance of behavioral cloning, outperforming various other regularization and causality-based methods on a variety of Atari environments and a self-driving CARLA environment. We also show that our method even outperforms inverse reinforcement learning methods trained with a considerable amount of environment interaction.
翻译:实践证明,行为性克隆对于从专家演示中学习顺序决策政策是有效的,然而,行为性克隆往往具有因果混淆问题,因为一项政策依赖专家行动明显的影响,而这种影响是由于强烈的相互关系,而不是我们所希望的原因。本文介绍了一种简单的技术,即“Oor-aut-aware REgulalizatiOn ” (OREO),这种技术可以以目标认知的方式规范仿制政策。我们的主要想法是鼓励一项政策,即一致关注所有语义物体,以防止该政策利用与专家行动密切相关的骚扰变数。为此,我们采用了两阶段办法:(a) 我们从图像中提取语义物体,方法是从矢量定量的变异自动编码中提取,以及(b) 我们随机地丢弃了共同使用同一离异编码的单位,即遮掩静性物体。我们的实验表明,OREO大大改进了行为性克隆的性能,超越了与专家行动密切相关的各种其他基于因果关系的方法。为此,我们采用了一种两阶段的方法:(a)我们利用从不同环境中分离的自我强化方法,我们学习了相当程度的环境。