We consider the problem of model selection for the general stochastic contextual bandits under the realizability assumption. We propose a successive refinement based algorithm called Adaptive Contextual Bandit ({\ttfamily ACB}), that works in phases and successively eliminates model classes that are too simple to fit the given instance. We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of {\ttfamily FALCON}, the state-of-art contextual bandit algorithm of Levi et. al '20, that needs knowledge of the true model class. The price of not knowing the correct model class is only an additive term contributing to the second order term in the regret bound. This cost possess the intuitive property that it becomes smaller as the model class becomes easier to identify, and vice-versa. We then show that a much simpler explore-then-commit (ETC) style algorithm also obtains a regret rate of matching that of {\ttfamily FALCON}, despite not knowing the true model class. However, the cost of model selection is higher in ETC as opposed to in {\ttfamily ACB}, as expected. Furthermore, {\ttfamily ACB} applied to the linear bandit setting with unknown sparsity, order-wise recovers the model selection guarantees previously established by algorithms tailored to the linear setting.
翻译:我们根据可变性假设考虑通用随机背景强盗的模型选择问题。 我们提出一个连续的精细算法,名为“适应性背景强盗”(Ttfrical ACB}),该算法是分阶段运作的,并连续消除过于简单、不适合特定实例的模型类。我们证明,这一算法是适应性的,即遗憾率顺序与当时FALCON}(Tttfamily FALCON})相匹配,需要了解真实模型类知识的Levi et al '20的最新背景强盗算法。但是,不知道正确模型类的代谢价格只是一个添加术语,只是对后悔约束的第二个顺序期的附加术语。这一成本具有随着模型类更容易识别而变小的直观属性。 然后我们证明,一个简单得多的探索(ETC)风格算法(ETC)也获得了一种遗憾率,即尽管不知道真正的模型类。 然而,在ETC选择模型时的成本比在前一连程选择A-Climal Climal A-CR 设置的回收期要高。