We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward, and propose an adjustment of standard imitation learning algorithms to fit this setup. We then discuss the problem of distribution shift between the expert data and the online environment when the data is only partially observable. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
翻译:我们考虑使用专家数据的问题,与未观察到的困惑者一起进行模拟和强化学习。我们首先从在相关的 MDP 设置中从混乱的专家数据中学习的问题开始。我们分析了从这些数据中学习的局限性,无论有无外部奖励,并提议调整标准模拟学习算法以适应这一设置。然后我们讨论专家数据与在线环境之间的分配变化问题,当数据只是部分可观测数据时。我们证明,在缺失的共变体的任意分配转换中,模仿学习是有可能和不可能的结果。如果提供了额外的外部奖励,我们建议了一个抽样程序,处理未知的转变,并证明与最佳解决办法一致。最后,我们用经验验证了我们关于挑战性辅助保健和推荐系统模拟任务的说法。