In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. To this end, we derive a convergent algorithm with $O(\sqrt{K})$ regret, where $K$ is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL \cite{ho2016generative}, but where the discriminator is replaced with the costs learned by the OAL problem. Our simulations demonstrate our theoretically grounded approach outperforms the baselines.
翻译:在学徒学习(AL)中,我们得到了一个没有成本功能的Markov决定程序(MDP),我们没有获得成本功能。相反,我们观察的是由一位按照某些政策行事的专家抽样的轨迹。我们的目标是找到一种与专家在某些预设的成本功能方面的表现相匹配的政策。我们引入了AL(在线学徒学习;OAL)的在线变体(在线学徒学习;OAL),预计代理商在与环境互动时能够与专家的轨迹数量相匹配。我们表明,OAL问题可以通过两种基于无回报的镜谱下行算法相结合来有效解决:一种用于政策优化,另一种用于学习最坏案例成本。最后,我们得出一种与美元(sqqrt{K})的趋同算法,其中美元是与MDP的互动次数;另外一种线性错误术语,它取决于专家的轨迹数量。重要的是,我们的算法避免了每次循环需要解决MDP,使其与前AL方法相比更加实用。最后,我们用我们所学的GAAL_A的深层次变式算法来取代了我们所学的GAAL的序列。