In this paper, we propose a Thompson Sampling algorithm for \emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $\mathcal{O}(\log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.
翻译:在本文中,我们提议了Thompson为 emph{unmodal} 土匪抽样算法,预期的奖赏是对部分订购的武器的单一方式。为了在每一步更好地利用单式结构,而不是从整个决策空间进行探索,我们的算法只能根据手臂周围的后部分布来做决定,而后部分布有最高的经验平均估计值。我们理论上证明,对于Bernoulli 来说,我们的算法的遗憾达到了单式强盗的较低范围,因此,这是同样最理想的。对于Gausian来说,我们的算法的遗憾是$\mathcal{O}(\log T)$,这比标准的Thompson抽样算法要好得多。广泛的实验显示了合成数据集和现实世界应用的拟议算法的有效性。