A central problem in sequential decision making is to develop algorithms that are practical and computationally efficient, yet support the use of flexible, general-purpose models. Focusing on the contextual bandit problem, recent progress provides provably efficient algorithms with strong empirical performance when the number of possible alternatives ("actions") is small, but guarantees for decision making in large, continuous action spaces have remained elusive, leading to a significant gap between theory and practice. We present the first efficient, general-purpose algorithm for contextual bandits with continuous, linearly structured action spaces. Our algorithm makes use of computational oracles for (i) supervised learning, and (ii) optimization over the action space, and achieves sample complexity, runtime, and memory independent of the size of the action space. In addition, it is simple and practical. We perform a large-scale empirical evaluation, and show that our approach typically enjoys superior performance and efficiency compared to standard baselines.
翻译:相继决策的一个中心问题是开发实用和计算效率高的算法,但支持使用灵活、通用的模式。以背景强盗问题为重点,最近的进展提供了非常高效的算法,当可能的替代(“动作”)数量很小时,具有很强的经验性表现,但是在大型连续行动空间的决策保障仍然渺茫,导致理论和实践之间的巨大差距。我们为具有连续、线性结构化行动空间的背景强盗提供了第一种高效、通用的算法。我们的算法利用计算器来(一) 监督学习,(二) 优化行动空间,实现与行动空间大小无关的样本复杂性、运行时间和记忆。此外,它是简单而实用的。我们进行了大规模的经验评估,并表明我们的方法通常比标准基线具有更高的性能和效率。