We provide a decision theoretic analysis of bandit experiments. Working within the framework of diffusion asymptotics, we define suitable notions of asymptotic Bayes and minimax risk for these experiments. For normally distributed rewards, the minimal Bayes risk can be characterized as the solution to a nonlinear second-order partial differential equation (PDE). Using a limit of experiments approach, we show that this PDE characterization also holds asymptotically under both parametric and non-parametric distribution of the rewards. The approach further describes the state variables it is asymptotically sufficient to restrict attention to, and therefore suggests a practical strategy for dimension reduction. The upshot is that we can approximate the dynamic programming problem defining the bandit experiment with a PDE which can be efficiently solved using sparse matrix routines. We derive the optimal Bayes and minimax policies from the numerical solutions to these PDEs. The proposed policies substantially dominate existing methods such as Thompson sampling. The framework can be generalized to allow for time discounting and pure exploration motives.
翻译:我们提供了对土匪实验的决定理论分析。 在扩散无症状试验的框架内, 我们为这些实验界定了无症状贝ys和微粒风险的适当概念。 对于通常分布的奖励, 最低限度的贝ys风险可以被描述为非线性二级部分差异方程式(PDE)的解决方案。 我们用一个有限的实验方法, 表明PDE的定性在对准和非对称性奖赏分布下也都处于无症状状态。 这种方法进一步描述了限制注意力的状态变量, 因而提出了减少规模的实用战略。 结果是我们可以将界定土匪实验的动态方案编制问题与PDE相近, 而PDE则可以用稀薄的矩阵常规有效解决。 我们从数字解决方案中从这些PDE中得出最佳的湾和小型马克斯政策。 拟议的政策在很大程度上支配了现有方法, 如汤普森取样。 框架可以普遍化, 以便有时间贴现和纯粹的勘探动机。