模拟学习中的反馈:共变转变的三个制度 (Feedback in Imitation Learning: The Three Regimes of Covariate Shift)

Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. Interactive approaches can provably address this divergence but require repeated querying of a demonstrator. Recent work identifies this divergence as stemming from a "causal confound" in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifestation of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on features that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoretically and practically, by taking advantage of a simulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and simple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surprising contrast with previous literature, but consistent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems.

翻译：光学学习实践者经常指出,对以往行动施加限制的政策导致“锁定”错误与学习者在现场的表现之间的巨大差异。互动方法可以解决这种差异,但需要反复询问示范者。最近的工作发现这种差异源于预测当前行动时的“因果混淆”,并试图利用因果推理工具来消除当前状态的因果关系。在这项工作中,我们争辩说,这种差异仅仅是共变的又一表现,特别是由于决定和输入特征之间的反馈的设置而加剧。学习者往往依赖对决定有强烈预测的特征,但又受到强烈的共变变化的转变。我们的工作表明,在理论上和实际上,通过利用模拟器而无需进一步查询专家演示,可以减轻这种转变。我们分析了用来测试模拟学习方法的现有基准,发现这些基准是真实的,因此不足以捕捉到现实世界决策中出现的更复杂的错误制度。我们发现,从一个令人惊讶的理论上看,我们发现,一个与以往的标准化的模型相比,我们所观察到的模型与以往的模型相符。