Various methods for solving the inverse reinforcement learning (IRL) problem have been developed independently in machine learning and economics. In particular, the method of Maximum Causal Entropy IRL is based on the perspective of entropy maximization, while related advances in the field of economics instead assume the existence of unobserved action shocks to explain expert behavior (Nested Fixed Point Algorithm, Conditional Choice Probability method, Nested Pseudo-Likelihood Algorithm). In this work, we make previously unknown connections between these related methods from both fields. We achieve this by showing that they all belong to a class of optimization problems, characterized by a common form of the objective, the associated policy and the objective gradient. We demonstrate key computational and algorithmic differences which arise between the methods due to an approximation of the optimal soft value function, and describe how this leads to more efficient algorithms. Using insights which emerge from our study of this class of optimization problems, we identify various problem scenarios and investigate each method's suitability for these problems.
翻译:在机器学习和经济学中,独立地制定了各种解决反强化学习(IRL)问题的方法。特别是,最大致癌环球IRL的方法是基于对子最大化的视角,而经济学领域的相关进步则假定存在未观测到的行动冲击,以解释专家行为(Nested Prible Point Algorithm, 有条件选择概率方法,Nested Pseudo-ellihood Algorithm)。在这项工作中,我们从两个领域在这些相关方法之间建立了以前未知的联系。我们通过显示它们都属于一类优化问题,其特点是目标、相关政策和目标梯度的共同形式。我们展示了由于优化软值功能而导致的方法之间在计算和算法上出现的关键差异,并描述了这如何导致更有效的算法。我们利用从对此类优化问题的研究中得出的见解,我们找出了各种问题设想,并调查每种方法是否适合这些问题。