In this work, we address risk-averse Bayesadaptive reinforcement learning. We pose the problem of optimising the conditional value at risk (CVaR) of the total return in Bayes-adaptive Markov decision processes (MDPs). We show that a policy optimising CVaR in this setting is risk-averse to both the parametric uncertainty due to the prior distribution over MDPs, and the internal uncertainty due to the inherent stochasticity of MDPs. We reformulate the problem as a two-player stochastic game and propose an approximate algorithm based on Monte Carlo tree search and Bayesian optimisation. Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.
翻译:在这项工作中,我们处理避免风险的贝氏适应性强化学习,我们提出了优化巴耶斯适应性马尔科夫决策流程总回报的有条件风险值(CVaR)的问题。我们表明,在这一背景下优化CVaR的政策既不利于因先前在MDP上的分布而导致的参数不确定性,也不利于因MDP内在的随机性而造成的内部不确定性。我们重新将该问题改写为双玩者随机游戏,并提出了基于蒙特卡洛树搜索和巴伊西亚优化的近似算法。我们的实验表明,我们的方法大大优于这一问题的基线方法。