We study the problem of model selection for contextual bandits, in which the algorithm must balance the bias-variance trade-off for model estimation while also balancing the exploration-exploitation trade-off. In this paper, we propose the first reduction of model selection in contextual bandits to offline model selection oracles, allowing for flexible general purpose algorithms with computational requirements no worse than those for model selection for regression. Our main result is a new model selection guarantee for stochastic contextual bandits. When one of the classes in our set is realizable, up to a logarithmic dependency on the number of classes, our algorithm attains optimal realizability-based regret bounds for that class under one of two conditions: if the time-horizon is large enough, or if an assumption that helps with detecting misspecification holds. Hence our algorithm adapts to the complexity of this unknown class. Even when this realizable class is known, we prove improved regret guarantees in early rounds by relying on simpler model classes for those rounds and hence further establish the importance of model selection in contextual bandits.
翻译:我们研究背景强盗的模型选择问题, 算法必须平衡模型估算的偏差取舍, 同时平衡勘探- 开发的取舍。 在本文中, 我们建议首先将背景强盗的模型选择减为离线模式选择或触角, 允许采用灵活的通用算法, 其计算要求不比回归模式选择要求差。 我们的主要结果是为随机强盗提供新的模型选择保障。 当我们组中的一个班级可以实现时, 直至对班级数的对数依赖性, 我们的算法在两个条件下之一( 如果时间偏差足够大, 或者如果有助于发现误差的假设成立 ) 下, 首次将背景强盗的模型选择减少。 因此, 我们的算法适应了这个未知班级的复杂程度。 即使知道这个可实现的班级, 我们证明早期的遗憾保证得到了改进, 依靠这些班子的更简单的模型班级, 从而进一步确定背景强盗的模式选择的重要性 。