In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We build on a reformulation of the lower bound, where context distribution and exploration policy are decoupled, and we obtain an algorithm robust to unbalanced context distributions. Then, using an incremental primal-dual approach to solve the Lagrangian relaxation of the lower bound, we obtain a scalable and computationally efficient algorithm. Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure. We demonstrate the asymptotic optimality of our algorithm, while providing both problem-dependent and worst-case finite-time regret guarantees. Our bounds scale with the logarithm of the number of arms, thus avoiding the linear dependence common in all related prior works. Notably, we establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits. Finally, we verify that our algorithm obtains better empirical performance than state-of-the-art baselines.
翻译:在上下文线性土匪背景下,基于乐观原则的算法未能利用问题的结构,并且被证明是微不足道的不理想的。在本文中,我们采用最近的办法,从问题依赖的遗憾较低界限中得出非现最佳算法,我们采用新颖的算法,在多个层面改进最先进的艺术。我们以重新拟订较低约束法为基础,即环境分布和探索政策相互脱钩,我们获得一种对不平衡的上下文分布的稳健的算法。然后,我们使用一种渐进的初级双向方法来解决低约束的拉格朗格放松问题,我们获得了一种可伸缩和计算效率高的算法。最后,我们取消了强迫探索,在最优化问题的间隔上建立信任的间隔,鼓励一种更适合问题结构的最低水平的探索。我们展示了我们的算法的无药性优化,同时提供了对问题依赖和最差的有限时间质悔保证。我们与武器数量的对数的界限,从而避免了在最低约束性线性依赖性共同的计算法,从而避免了我们以前在最短的基线上进行更好的研究。最后,我们为最优的模型进行任何最优的实验性的研究,我们为最优的模型进行最优的实验性级级级级级级级的学前工作。