We introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and (nearly) worst-case optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information and uncovers a surprising connection between the recently proposed primal-dual methods and the IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.
翻译:我们为随机线性强盗采用一种简单而有效的算法,其行动数量有限,在有限时间内是尽可能最佳和(近乎)最坏情况最理想的。这个方法基于常年信息导向抽样框架,以优化问题为信息收益的替代方法,因为优化问题界定了无症状下界。我们的分析揭示了ISD如何平衡遗憾与信息之间的权衡,并揭示了最近提出的初创方法与ISDS算法之间的令人惊讶的联系。我们从经验上证明,ISD在有限时间内与UCB具有竞争力,在无症状制度下可以大大改进。