We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.
翻译:我们研究Thompson抽样算法(TS)对指数式家庭匪徒的遗憾,在这种算法中,奖励分配来自一维指数式家庭,它涵盖许多共同的奖赏分配,包括Bernoulli、Gaussian、Gamma、Exential等。我们提出一个叫Extas的汤普森抽样算法,它使用一种新型抽样分配法,以避免低估最佳手臂;我们为Extas提供严格的遗憾分析,它既产生有限时间的遗憾,又产生容易受药效约束的遗憾。特别是,一个拥有指数式家庭奖赏的KK$-武装匪徒,在地平线上的Extas是次UBB(一个与问题相关的有限时间遗憾的强烈标准),迷你Max最优化的采样算法,可以同时将Aprintimasimasimeximalalal eximalal exprility asimpressal eximalitalital expressal eximalital asimalitaltialaltialal as exitaltial exitaltial exital exlist exital exital exital extistr extistral extistral extistral extimatime as ex ex ex exmal exmalitaltistral exitalitalitalitaltial resmaltialtialtialtialtialtial resmal resmissal 可以同时获得最佳最佳分配。 ex ex ex ex ex ex ex ex exm exmal exmal extial exti exmal exmal *,可以算算算算算可以同时取得最佳最佳最佳最佳最佳最佳最佳最佳最佳的奖值分配,可以提出出,可以提议算算算算作可以同时取得最佳最佳最佳最佳分配。 和最佳最佳最佳最佳最佳最佳最佳最佳最佳最佳最佳的奖分制, ex 和最最最最佳的计算方法,可以提议。 ex 和最佳最佳最佳最佳最佳最佳最佳最佳最佳最佳的计算方法,可以同时提出最佳最佳最佳的