We present SoftDICE, which achieves state-of-the-art performance for imitation learning. SoftDICE fixes several key problems in ValueDICE, an off-policy distribution matching approach for sample-efficient imitation learning. Specifically, the objective of ValueDICE contains logarithms and exponentials of expectations, for which the mini-batch gradient estimate is always biased. Second, ValueDICE regularizes the objective with replay buffer samples when expert demonstrations are limited in number, which however changes the original distribution matching problem. Third, the re-parametrization trick used to derive the off-policy objective relies on an implicit assumption that rarely holds in training. We leverage a novel formulation of distribution matching and consider an entropy-regularized off-policy objective, which yields a completely offline algorithm called SoftDICE. Our empirical results show that SoftDICE recovers the expert policy with only one demonstration trajectory and no further on-policy/off-policy samples. SoftDICE also stably outperforms ValueDICE and other baselines in terms of sample efficiency on Mujoco benchmark tasks.
翻译:我们展示了SOftDICE, 它实现了最先进的模仿学习业绩。 软件DICE 解决了价值DICE中的若干关键问题。 价值DICE 是一种用于抽样高效模仿学习的非政策分配匹配方法。 具体地说, 价值DICE 的目标包含对数和预期指数, 微型批量梯度估计总是有偏差。 第二, 价值DICE 在专家演示数量有限, 但改变原始分布匹配问题时, 将目标与缓冲样本重新显示为常规。 第三, 用于得出非政策目标的重新平衡技巧依赖于一个在培训中很少保持的隐含假设。 我们利用了发行匹配的新配方, 并审议了一个加密常规非政策目标, 产生了一种完全离线的算法, 叫做 SoftDICE。 我们的经验结果表明, 软件DICE 恢复了专家政策, 只有一个演示轨迹, 没有进一步的政策/ 政策抽样。 软件DICE 也直截了 Mujoco 基准任务样本效率方面的价值DICE和其他基线。