统一强盗和高山强盗的以汤普森抽样为基础的政策</s> (Asymptotically Optimal Thompson Sampling Based Policy for the Uniform Bandits and the Gaussian Bandits)

Thompson sampling (TS) for the parametric stochastic multi-armed bandits has been well studied under the one-dimensional parametric models. It is often reported that TS is fairly insensitive to the choice of the prior when it comes to regret bounds. However, this property is not necessarily true when multiparameter models are considered, e.g., a Gaussian model with unknown mean and variance parameters. In this paper, we first extend the regret analysis of TS to the model of uniform distributions with unknown supports. Specifically, we show that a switch of noninformative priors drastically affects the regret in expectation. Through our analysis, the uniform prior is proven to be the optimal choice in terms of the expected regret, while the reference prior and the Jeffreys prior are found to be suboptimal, which is consistent with previous findings in the model of Gaussian distributions. However, the uniform prior is specific to the parameterization of the distributions, meaning that if an agent considers different parameterizations of the same model, the agent with the uniform prior might not always achieve the optimal performance. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian distributions and the uniform distributions by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. The pre-processig of the posterior distribution is the key to TS-T, where we add an adaptive truncation procedure on the parameter space of the posterior distributions. Simulation results support our analysis, where TS-T shows the best performance in a finite-time horizon compared to other known optimal policies, while TS with the invariant priors performs poorly.

翻译：在单维参数模型下,对多臂强盗的测深模型(TS)的汤普森抽样(TS)进行了仔细研究。经常有报告说,在出现遗憾界限时,TS对选择之前的选择相当不敏感。然而,当考虑多参数模型时,这种属性不一定是真实的,例如,一个具有未知平均值和差异参数的Gaussia模型。在本文中,我们首先将TS的遗憾分析扩展至有未知支持的统一分布模型。具体地说,我们表明,非信息化前端的转换会大大影响期待中的遗憾。通过我们的分析,从预期的遗憾角度看,前端参数被证明是最佳选择的,而前端和前端的参考则被认为是次优化的。在前方模型中,前方的校正分布与前方模型的参数具体有关,这意味着,如果一个代理考虑同一模型的不同参数,前端的代理器可能永远无法达到最佳的性能。根据这一限制,我们建议,前端和前方的后方的后端分配过程是前方的SDR-在前端分析中,在前端分析中,在前端的SDIS前端分析中,在前端分析中,在前端分析中,我们前端的SDir的流中,我们可以显示的SD-后端的SD-SD-SD-在前端分析中,我们的S-在前端的S-S-在前的S-在前端的S-S-在前端分配过程,在前端分析中,在前端,在前端,在前端的SD-后,在前端,在前端分析中,在前端,在前端-后,在前端-后,在前端-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-级-</s>