逆Q学习正确实现：$Q^π$-可实现马尔可夫决策过程中的离线模仿学习 (Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs)

We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

翻译：我们研究了马尔可夫决策过程（MDPs）中的离线模仿学习问题，其目标是在给定由专家策略生成的状态-动作对数据集的情况下，学习一个表现良好的策略。近期一系列关于该主题的研究假设专家属于一类已知的、易于处理的策略类别，与此互补，我们从一个新的角度切入该问题，并利用了一种关于环境的不同类型的结构性假设。具体而言，针对线性$Q^\pi$-可实现的MDPs类别，我们引入了一种称为鞍点离线模仿学习（\SPOIL）的新算法，该算法保证在访问$\mathcal{O}(\varepsilon^{-2})$个样本的情况下，能够以不超过$\varepsilon$的加性误差匹配任何专家的性能。此外，我们将此结果扩展到可能非线性的$Q^\pi$-可实现的MDPs，代价是样本复杂度变差，为$\mathcal{O}(\varepsilon^{-4})$量级。最后，我们的分析提出了一种新的损失函数，用于在深度模仿学习中根据专家数据训练评论家网络。在标准基准测试上的实证评估表明，\SPOIL的神经网络实现优于行为克隆，并与最先进的算法具有竞争力。