Adversarial imitation learning has become a widely used imitation learning framework. The discriminator is often trained by taking expert demonstrations and policy trajectories as examples respectively from two categories (positive vs. negative) and the policy is then expected to produce trajectories that are indistinguishable from the expert demonstrations. But in the real world, the collected expert demonstrations are more likely to be imperfect, where only an unknown fraction of the demonstrations are optimal. Instead of treating imperfect expert demonstrations as absolutely positive or negative, we investigate unlabeled imperfect expert demonstrations as they are. A positive-unlabeled adversarial imitation learning algorithm is developed to dynamically sample expert demonstrations that can well match the trajectories from the constantly optimized agent policy. The trajectories of an initial agent policy could be closer to those non-optimal expert demonstrations, but within the framework of adversarial imitation learning, agent policy will be optimized to cheat the discriminator and produce trajectories that are similar to those optimal expert demonstrations. Theoretical analysis shows that our method learns from the imperfect demonstrations via a self-paced way. Experimental results on MuJoCo and RoboSuite platforms demonstrate the effectiveness of our method from different aspects.
翻译:歧视者通常通过将专家示范和政策轨迹分别作为两类(积极与消极)的范例,进行专家示范和政策轨迹培训,然后该政策预计将产生与专家演示不相容的轨迹。但在现实世界,所收集的专家示范活动更有可能不完美,在模拟示范活动中,只有未知的一小部分是最佳的。我们不把不完美的专家演示看成绝对正或负的,而是调查没有标记的不完美的专家演示。正无标记的对抗性模拟学习算法正在发展成动态的样板专家演示,与不断优化的代理政策相匹配。初始代理政策的轨迹可能更接近于那些非优化的专家演示,但在对抗性模拟学习的框架内,代理政策将优化,以欺骗歧视者,产生与这些最佳专家演示相类似的轨迹。理论分析表明,我们的方法通过自制式的平台和实验性平台从不完善的演示中学习。