Monte Carlo planners can often return sub-optimal actions, even if they are guaranteed to converge in the limit of infinite samples. Known asymptotic regret bounds do not provide any way to measure confidence of a recommended action at the conclusion of search. In this work, we prove bounds on the sub-optimality of Monte Carlo estimates for non-stationary bandits and Markov decision processes. These bounds can be directly computed at the conclusion of the search and do not require knowledge of the true action-value. The presented bound holds for general Monte Carlo solvers meeting mild convergence conditions. We empirically test the tightness of the bounds through experiments on a multi-armed bandit and a discrete Markov decision process for both a simple solver and Monte Carlo tree search.
翻译:蒙特卡洛规划者往往可以返回亚最佳行动,即使它们保证在无限样本的限度内汇合。已知的无症状的遗憾界限在搜索结束时无法提供任何方法来衡量对推荐行动的信心。在这项工作中,我们证明蒙特卡洛对非静态强盗和Markov决策程序的亚最佳估计值的界限。这些界限可以在搜索结束时直接计算,而不需要了解真正的行动价值。 提交的界限被锁定给符合温和趋同条件的蒙特卡洛普通解决者。我们通过多臂强盗和离散的Markov决定程序的实验,对一个简单的解决者和蒙特卡洛树的搜索进行实验,对界限的紧密性进行了实验。