具有已知和未知过渡的更有效的对流模拟学习比值 (More Efficient Adversarial Imitation Learning Algorithms With Known and Unknown Transitions)

In this work, we design provably (more) efficient imitation learning algorithms that directly optimize policies from expert demonstrations. Firstly, when the transition function is known, we build on the nearly minimax optimal algorithm MIMIC-MD and relax a projection operator in it. Based on this change, we develop an adversarial imitation learning (AIL) algorithm named \emph{TAIL} with a gradient-based optimization procedure. Accordingly, TAIL has the same sample complexity (i.e., the number of expert trajectories) $\widetilde{\mathcal{O}}(H^{3/2} |\mathcal{S}|/\varepsilon)$ with MIMIC-MD, where $H$ is the planning horizon, $|\mathcal{S}|$ is the state space size and $\varepsilon$ is desired policy value gap. In addition, TAIL is more practical than MIMIC-MD as the former has a space complexity $\mathcal{O} (|\mathcal{S}||\mathcal{A}|H)$ while the latter's is about $\mathcal{O} (|\mathcal{S}|^2 |\mathcal{A}|^2 H^2)$. Secondly, under the scenario where the transition function is unknown but the interaction is allowed, we present an extension of TAIL named \emph{MB-TAIL}. The sample complexity of MB-TAIL is still $\widetilde{\mathcal{O}}(H^{3/2} |\mathcal{S}|/\varepsilon)$ while the interaction complexity (i.e., the number of interaction episodes) is $\widetilde{\mathcal{O}} (H^3 |\mathcal{S}|^2 |\mathcal{A}| / \varepsilon^2)$. In particular, MB-TAIL is significantly better than the best-known OAL algorithm, which has a sample complexity $\widetilde{\mathcal{O}}(H^{2} |\mathcal{S}|/\varepsilon^2)$ and interaction complexity $\widetilde{\mathcal{O}} (H^4 |\mathcal{S}|^2 |\mathcal{A}| / \varepsilon^2)$. The advances in MB-TAIL are based on a new framework that connects reward-free exploration and AIL. To our understanding, MB-TAIL is the first algorithm that shifts the advances in the known transition setting to the unknown transition setting.

翻译：在这项工作中,我们设计了可以理解的(更多的)高效模拟学习算法,直接优化专家演示的政策。首先,当知道过渡功能时,我们借助近乎小型最佳的MIMIMIMM-MD算法,并放松其中的投影操作员。根据这一变化,我们开发了名为\emph{TAIL]的对抗性模拟算法(AIL),使用一个基于梯度的优化程序。因此,TAIL拥有相同的样本复杂性(即专家的复杂度) $全方位 {(SQ3) 互动(O3) 大大的 OQQQQQQ) 。在MILM2 的扩展功能中, 美元是州空间大小, 美元是州际的值。此外,TAIL比MI的复杂度(===QQQQQQQ), 而目前空间的复杂度(O&QQAR_BAR_BAR_BAR) 。