In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.
翻译:在计划的离线模型学习和强化学习场景中,有限的数据集制约了相关马尔可夫决策过程(MDP)价值函数的估计。因此,获得的策略在实际环境中的表现受到限制,尤其是当错误的策略部署可能导致灾难性后果时。出于这个原因,人们正在采取几种途径,以减少模型误差(或学习模型和真实模型之间的分布偏移),从而获得相对于模型不确定性风险感知的解决方案。但是,在最终应用中,从现有基线中选择哪个更好?在脱机计算时间不是问题而鲁棒性是重点的情况下,我们提出了利用开发 vs 小心(EvC)范式,它(1)优雅地结合了模型不确定性,遵守贝叶斯规范,(2)从提供的一组固定候选策略中选择策略,该候选策略由当前基线提供,使风险感知目标在贝叶斯后验之上最大化。我们在不同的离散,但简单的环境中验证了 EvC,提供了公平造型的马尔可夫决策过程(MDP)类的多样性。在测试的场景中,EvC能够选择稳健的策略,因此,对于旨在在现实世界中应用脱机规划和强化学习求解器的实践者来说,它是一个实用的工具。