Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm which can focus on learning the MDP dynamics which are most relevant for obtaining high rewards. While this approach increases the performance of agents by focusing the learning towards optimizing for the reward directly, it does so by learning less accurate dynamics (from a MLE standpoint), and may thus be brittle to changes in the reward function. In this work, we develop the robust decision-focused (RDF) algorithm which leverages the non-identifiability of DF solutions to learn models which maximize expected returns while simultaneously learning models which are robust to changes in the reward function. We demonstrate on a variety of toy example and healthcare simulators that RDF significantly increases the robustness of DF to changes in the reward function, without decreasing the overall return the agent obtains.
翻译:摘要:最近引入的基于模型的决策聚焦(DF)强化学习是一种强大的算法,可以聚焦于学习最相关于获得高回报的马尔可夫决策过程(MDP)的动态。虽然此方法通过直接优化回报来集中学习以增加智能体的性能,但是它通过学习更不准确的动态(从MLE角度)实现,因此对于奖励函数的变化可能脆弱。在这项工作中,我们开发了稳健决策聚焦(RDF)算法,利用DF解的非可识别性来学习最大化预期回报的模型,同时学习鲁棒于奖励函数变化的模型。我们在多个玩具示例和医疗模拟器上证明了RDF显著提高了DF对奖励函数变化的鲁棒性,而不降低智能体的总回报。