Sequential decision-making under cost-sensitive tasks is prohibitively daunting, especially for the problem that has a significant impact on people's daily lives, such as malaria control, treatment recommendation. The main challenge faced by policymakers is to learn a policy from scratch by interacting with a complex environment in a few trials. This work introduces a practical, data-efficient policy learning method, named Variance-Bonus Monte Carlo Tree Search~(VB-MCTS), which can copy with very little data and facilitate learning from scratch in only a few trials. Specifically, the solution is a model-based reinforcement learning method. To avoid model bias, we apply Gaussian Process~(GP) regression to estimate the transitions explicitly. With the GP world model, we propose a variance-bonus reward to measure the uncertainty about the world. Adding the reward to the planning with MCTS can result in more efficient and effective exploration. Furthermore, the derived polynomial sample complexity indicates that VB-MCTS is sample efficient. Finally, outstanding performance on a competitive world-level RL competition and extensive experimental results verify its advantage over the state-of-the-art on the challenging malaria control task.
翻译:在成本敏感的任务之下,有顺序的决策极为艰巨,特别是对于严重影响人们日常生活的问题,例如疟疾控制、治疗建议等,令人望而却步,令人望而却步。决策者面临的主要挑战是通过在少数试验中与复杂的环境互动,从零开始学习政策。这项工作引入了一种实用的、数据效率高的政策学习方法,名为“Craev-Bonus Monte Carlo Travel Search~(VB-MCTS)”(VB-MCTS),该方法仅能以极少的数据复制,便于在少数试验中从零开始学习。具体地说,解决办法是一种基于模型的强化学习方法。为避免模型偏差,我们采用高山进程~(GP)回归法来明确估计过渡情况。我们用GP世界模型提出一个差异-bonus奖励,以衡量世界的不确定性。将奖励加到与MCTS的规划中可以产生更有效率和更有效的探索。此外,从中得出的多数值样本表明VB-MCTS是有效的样本。最后,在竞争性的RL竞争和广泛的实验结果上表现突出的成绩,以证实其在具有挑战性疟疾控制状态的优势。