Designing efficient general-purpose contextual bandit algorithms that work with large -- or even continuous -- action spaces would facilitate application to important scenarios such as information retrieval, recommendation systems, and continuous control. While obtaining standard regret guarantees can be hopeless, alternative regret notions have been proposed to tackle the large action setting. We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives. We design a statistically and computationally efficient algorithm -- for the proposed smooth regret -- that works with general function approximation under standard supervised oracles. We also present an adaptive algorithm that automatically adapts to any smoothness level. Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret, e.g., in bandit problems with multiple best arms and Lipschitz/H{\"o}lder bandits. We conduct large-scale empirical evaluations demonstrating the efficacy of our proposed algorithms.
翻译:设计高效的通用背景土匪算法,这些算法与大型 -- -- 甚至连续的 -- -- 行动空间一起工作,将有利于对重要情景的应用,例如信息检索、建议系统和连续控制。虽然获得标准的遗憾保证可能是没有希望的,但已经提出了处理大规模行动的替代遗憾概念。我们提出了针对背景强盗的顺利遗憾概念,这些强盗以前提出的替代办法占了主导地位。我们设计了一种统计和计算效率高的算法 -- -- 用于拟议的顺利遗憾 -- -- 在标准监督的手腕下与一般功能近似相配合。我们还提出了一种适应性算法,自动适应任何平滑的程度。我们的算法可以用来恢复标准遗憾下先前的微型/帕雷托最佳保证,例如,在有多种最佳武器的强盗问题和利普施茨/H=o}老盗。我们进行了大规模的实证评估,展示了我们提议的算法的效力。