This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges to a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.
翻译:本文回顾了巴伊西亚环境的土匪问题。 贝伊西亚方法将土匪问题描述为一个优化问题, 目标是找到最佳政策, 最大限度地减少巴伊西亚人的遗憾。 巴伊西亚方法所面临的主要挑战之一是, 最佳政策的计算往往难以解决, 特别是当问题地平线长度或武器数量巨大时。 在本文中, 我们首先显示, 在适当的调整下, 巴伊西亚土匪问题会与持续的汉密尔顿- 贾科比- 贝尔曼( HJB) 等式( HJB) 相融合。 限制HJB 等式的最佳政策可以明确针对几个共同土匪问题获得, 我们给出数字方法, 在无法找到明确解决方案时, 解决 HJB 等式。 基于这些结果, 我们提出一个近乎巴伊亚最佳的政策, 解决大地平线的巴伊斯山土匪问题。 我们的方法有附加的好处, 即其计算成本不会随着地平线的增加而增加。