The gloabal objective of inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories generated by (approximate) optimal policies. The classical approach consists in tuning this cost function so that associated optimal trajectories (that minimise the cumulative discounted cost, i.e. the classical RL loss) are 'similar' to the observed ones. Prior contributions focused on penalising degenerate solutions and improving algorithmic scalability. Quite orthogonally to them, we question the pertinence of characterising optimality with respect to the cumulative discounted cost as it induces an implicit bias against policies with longer mixing times. State of the art value based RL algorithms circumvent this issue by solving for the fixed point of the Bellman optimality operator, a stronger criterion that is not well defined for the inverse problem. To alleviate this bias in IRL, we introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem. The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
翻译:反强化学习(IRL)的球形目标是估计某些MDP基础在(近似)最佳政策产生的观测轨迹上的未知成本功能。古典方法包括调整这一成本功能,使相关的最佳轨迹(最大限度地减少累计折扣成本,即经典RL损失)与所观察到的轨迹“相似”。以前,捐款的重点是惩罚堕落的解决方案,改进算法的可扩展性。对于它们来说,我们质疑累积折扣成本方面最佳性的特点是否具有内在的关联性,因为它会给政策带来长期混合的隐含偏差。基于艺术价值的RL算法状态通过解决贝尔曼最佳经营商的固定点而绕过这一问题,这是一个较强的标准,但对于反面问题并没有很好地界定。为了减轻IRL的这种偏差,我们引入了一种替代性培训损失,它给未来国家带来(最大离轨)IRL问题的重新组合。我们设想的基于艺术价值的算法在多种健身场环境上提高了公开性(和类似性)。