剥削与谨慎:对风险敏感的离线学习政策 (Exploitation vs Caution: Risk-sensitive Policies for Offline Learning)

Offline model learning for planning is a branch of machine learning that trains agents to perform actions in an unknown environment using a fixed batch of previously collected experiences. The limited size of the data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP), bounding the performance of the obtained policy in the real world. In this context, recent works showed that planning with a discount factor lower than the one used during the evaluation phase yields more performing policies. However, the optimal discount factor is finally chosen by cross-validation. Our aim is to show that looking for a sub-optimal solution of a Bayesian MDP might lead to better performances with respect to the current baselines that work in the offline setting. Hence, we propose Exploitation vs Caution (EvC), an algorithm that automatically selects the policy that solves a Risk-sensitive Bayesian MDP in a set of policies obtained by solving several MDPs characterized by different discount factors and transition dynamics. On one hand, the Bayesian formalism elegantly includes model uncertainty and on another hand the introduction of a risk-sensitive utility function guarantees robustness. We evaluated the proposed approach in different discrete simple environments offering a fair variety of MDP classes. We also compared the obtained results with state-of-the-art offline learning for planning baselines such as MOPO and MOReL. In the tested scenarios EvC is more robust than the said approaches suggesting that sub-optimally solving an Offline Risk-sensitive Bayesian MDP (ORBMDP) could define a sound framework for planning under model uncertainty.

翻译：用于规划的离线模型学习是机器学习的一个分支,它用来培训代理人使用先前收集的一组固定经验在未知环境中采取行动。数据集的有限规模妨碍了相对的Markov决策程序(MDP)的价值功能估计,限制了实际世界中所获政策的绩效。在这方面,最近的工作表明,以低于评价阶段所用折扣系数的折扣系数进行规划,就能产生更多的政策。然而,最佳折扣系数最终是通过交叉校验选择的。我们的目的是表明,寻找一种巴伊西亚MDP的次最佳解决方案,可能会导致在离线设置中运行的当前基线方面实现更好的不确定性。因此,我们提出利用与谨慎(EvC)的算法,这种算法可以自动选择一种政策,通过解决若干以不同折扣因素和过渡动态为特征的MDP,从而解决风险敏感的Bayesan Reflicalismission 。我们用不同离离离离离离线的MDP的基线方法,还评估了不同离线的深度的MDF-R-L 方法。我们用不同离离离离线的模型来提出一个比较的深度的MD-D-L 学习。我们还评估了一种建议,在不同的离离线中提出了一种比较的离线式的离线的离线式的离线式的离线。