In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain predefined reward set. In this paper, we study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps. Besides the expert demonstration, in the online setting the agent can interact with the environment, while in the offline setting the agent only accesses an additional dataset collected by a prior. For online GAIL, we propose an optimistic generative adversarial policy optimization algorithm (OGAP) and prove that OGAP achieves $\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here $N_1$ represents the number of trajectories of the expert demonstration, $d$ is the feature dimension, and $K$ is the number of episodes. For offline GAIL, we propose a pessimistic generative adversarial policy optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset. Assuming sufficient coverage on the additional dataset, we show that PGAP achieves $\widetilde{\mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$ represents the number of trajectories of the additional dataset with sufficient coverage.
翻译:在基因模拟模拟学习(GAIL)中,代理商的目的是从专家演示中学习一项政策,以便其业绩不会因特定预定奖赏组的专家政策而受专家政策歧视。在本文中,我们用线性函数近似值在在线和离线设置中研究GAIL,其中过渡和奖赏功能在特征地图中都是线性。除了专家演示外,在在线设置中,代理商可以与环境互动,而在离线设置代理商只访问先前收集的额外数据集。对于在线GAIL,我们提议一个乐观的超正性对抗性政策优化算法(OAP),并证明O+2=2=2=3=K2+2+2+2+2+2}K}线性函数近近近近近近近。这里的N_1美元代表专家演示的轨迹数, 美元是特性层面, 美元是最佳事件数。对于离线性GA的基调性基数,我们建议一个额外的基调基调的基调基调基调基调基调基调数据。