The shortcomings of maximum likelihood estimation in the context of model-based reinforcement learning have been highlighted by an increasing number of papers. When the model class is misspecified or has a limited representational capacity, model parameters with high likelihood might not necessarily result in high performance of the agent on a downstream control task. To alleviate this problem, we propose an end-to-end approach for model learning which directly optimizes the expected returns using implicit differentiation. We treat a value function that satisfies the Bellman optimality operator induced by the model as an implicit function of model parameters and show how to differentiate the function. We provide theoretical and empirical evidence highlighting the benefits of our approach in the model misspecification regime compared to likelihood-based methods.
翻译:越来越多的论文突出说明了模型强化学习中最大可能性估算的缺点。当模型类别被错误地描述或代表能力有限时,极有可能的模型参数不一定导致代理人在下游控制任务方面的高绩效。为缓解这一问题,我们建议了一种端对端的模型学习方法,通过隐含的差别,直接优化预期回报。我们把符合模型所引领的贝尔曼最佳操作员的价值功能视为模型参数的隐含功能,并表明如何区分功能。我们提供了理论和经验证据,强调我们在模型区分方法与基于可能性的方法相比,在错误区分制度中采用的方法的好处。