We provide a decision theoretic analysis of bandit experiments. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for these experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distributions of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and thereby suggests a practical strategy for dimension reduction. The PDEs characterizing minimal Bayes risk can be solved efficiently using sparse matrix routines. We derive the optimal Bayes and minimax policies from their numerical solutions. These optimal policies substantially dominate existing methods such as Thompson sampling and UCB, often by a factor of two. The framework also covers time discounting and pure exploration.
翻译:我们提供了对盗匪实验的决定理论分析。 在扩散无症状的框架中,我们界定了这些实验的无症状湾和微粒风险的适当概念。对于通常分布的奖励,最低限度的贝ys风险可以被定性为二级局部差异方程(PDE)的解决方案。我们用一个实验的限度来表明,这种PDE特征在对准和非参数的奖励分布下也处于无症状状态。该方法进一步描述了限制注意力的状态变量,从而提出了减少维度的实用战略。将最小海湾风险定性为最小海湾的PDE可以使用稀有的矩阵常规有效解决。我们从数字解决方案中获取最佳的湾和小型马克斯政策。这些最佳政策在很大程度上支配了诸如汤普森采样和UCB(UCB)等现有方法,通常以两个要素为主。框架还包括时间折扣和纯度勘探。