In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.
翻译:在强化学习文献中,环境强盗(CB)或Markov决定程序(MDP)环境都有许多演算法。然而,在现实世界中部署强化学习算法时,即使有域内的专门知识,也往往难以知道将顺序决策问题作为CB或MDP处理是否适当。换句话说,行动是否影响未来国家,还是只直接产生直接的回报?对环境性质作出错误的假设可能导致低效学习,甚至甚至阻止算法学习最佳政策,即使有无限的数据。在这项工作中,我们开发了一种在线算法,使用巴伊西亚假设测试法来学习环境的性质。我们的算法允许从业者事先了解环境是CB还是MDP,并有效地将经典CB和MDP的算法结合起来,以减轻对环境的误判影响。我们进行模拟,并证明在CBB环境中,我们的算法比MDP的算法更低的遗憾,而在非bidit MDP环境中,我们的算法能够学习最优的政策,常常取得可比较的MDP。