In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.
翻译:在随机多臂土匪问题中,一种随机概率匹配政策,称为汤普森抽样(TS),在各种奖赏模型中表现优异。除了经验性表现外,还显示TS在一些模型中达到无症状问题下限。然而,它的最佳性主要在属于指数型家族的轻尾或单参数模型下得到处理。在本文中,我们考虑了TS对具有重尾和两个未知参数参数参数的Pareto模型的最佳性。具体地说,我们讨论了TS与概率匹配前前期的最佳性,包括Jeffreys之前和参考前期。我们首先证明,具有某些概率匹配前期的TS能够达到最佳遗憾约束。然后,我们显示了TS与其他前期(包括Jeffers和参考前期)相比的不适当性。然而,我们发现,如果使用调试程序,带有Jeffers和参考前期的TS可以实现以无症状为基础的较低约束性。这些结果表明,在TVTS前期中谨慎地选择了非infortical-tality,以避免亚型政策的有效性。