We study the statistical limits of Imitation Learning (IL) in episodic Markov Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the known-transition setting where the learner is provided a dataset of $N$ length-$H$ trajectories from a deterministic expert policy and knows the MDP transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we prove to be computationally efficient. In contrast, we show the minimax suboptimality grows as $\Omega( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while the unknown-transition setting suffers from a larger sharp rate $\Theta(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as building connections with linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality $O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high probability if the expert is not constrained to be optimal. Our work formally establishes the benefit of the expert optimal assumption in the known transition setting, while Rajaraman et al (2020) showed it does not help when transitions are unknown.
翻译:我们用国家空间 $\ mathcal{S}_S} 美元来研究模拟学习(IL) 的统计限度。 我们侧重于已知的过渡环境, 学习者从确定的专家政策中获得一个长度- 美元轨道数据集, 并了解MDP过渡。 我们用Rajaraman 等人( 202020年) 的 Mimi- MD 算法为亚最佳程度( MDPs) 建立上限 $( mathcal{ S ⁇ H3/2}/N) 。 我们侧重于已知的过渡环境。 相反, 当学习者从确定性专家政策中获得一个长度- 美元( mathcal{S ⁇ 3/2}, 而未知的Oral 值值值值正在上升时, 最高级专家( Rajaraman et al (2020) 只能显示最优水平( Rajaraman et al) 。 建立更低的框, 以证明IL 之间的两次递减, 和假设性值( O) 直位值) 的运算值函数显示一个未知的轨值, 直径值( 我们的O) 的估值, 直位专家的变值能显示的是, 直值的变值 直值, 直值的值的值的折值, 直值, 直值的折值 直值 直值 直值 直值 直值, 直值 直值 。