We consider large-scale Markov decision processes with an unknown cost function and address the problem of learning a policy from a finite set of expert demonstrations. We assume that the learner is not allowed to interact with the expert and has no access to reinforcement signal of any kind. Existing inverse reinforcement learning methods come with strong theoretical guarantees, but are computationally expensive, while state-of-the-art policy optimization algorithms achieve significant empirical success, but are hampered by limited theoretical understanding. To bridge the gap between theory and practice, we introduce a novel bilinear saddle-point framework using Lagrangian duality. The proposed primal-dual viewpoint allows us to develop a model-free provably efficient algorithm through the lens of stochastic convex optimization. The method enjoys the advantages of simplicity of implementation, low memory requirements, and computational and sample complexities independent of the number of states. We further present an equivalent no-regret online-learning interpretation.
翻译:我们认为大规模马尔科夫决策程序具有未知的成本功能,并解决从一组有限的专家演示中学习政策的问题。我们假定,学习者不得与专家互动,也无法获得任何类型的强化信号。现有的反强化学习方法具有很强的理论保障,但计算成本很高,而最先进的政策优化算法却取得了重大的经验成功,但受到有限的理论理解的阻碍。为了弥合理论与实践之间的差距,我们采用拉格朗格双重性引入了一个新的双线性双线性马鞍点框架。拟议的初线双面观点使我们能够通过随机共融优化的镜头开发出一种无模型的有效算法。这种方法具有实施简单性、记忆要求低以及计算和抽样复杂性的优势,这些都独立于国家的数目。我们还提出了类似的无关系在线学习解释。