We introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and worst-case rate optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information. Moreover, we uncover a surprising connection between the recently proposed primal-dual methods and the Bayesian IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.
翻译:我们对随机线性强盗采用一种简单而有效的算法,其行动数量有限,在有限时间内是尽可能最佳和最坏的。这个方法基于常态信息导向抽样框架(IDS),其信息收益的替代方法来自界定无药性下层的优化问题。我们的分析揭示了IDS如何平衡遗憾与信息之间的取舍。此外,我们发现了最近提出的原始双向方法与巴伊西亚的IDS算法之间令人惊讶的联系。我们从经验上表明,IDS在有限时间内与UCB具有竞争力,在无药性制度中可以大大改进。