We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.
翻译:我们的方法要求每个基础学习者都带着候选人的遗憾,而我们的元算法则则按照一个使基础学习者候选人的遗憾界限保持平衡,直到发现他们违反了其保障的时间安排,在土匪情况下,我们研究在土匪情况下的模式选择问题,目的是获得同时的对抗性和随机性(“两个世界的最佳”)高概率的遗憾保证。我们的方法要求每个基础学习者都带着一个可能或可能无法维持的受候选人约束的遗憾,而我们的元算法则按照一个使基础学习者候选人的遗憾界限保持平衡,直到发现他们违反了其保障的时间安排。我们专门设计了仔细的错误区分测试,目的是将上述模型选择标准与利用环境(潜在的良性)性质的能力结合起来。我们恢复了CORRAL算法在对抗环境中的示范选择保证,但额外的好处是达到高概率的遗憾界限,特别是在筑巢对抗性线性强盗的情况下。更重要的是,我们的模型选择结果在差距假设下的随机环境中也同时保持。这是在进行模型选择时取得最佳世界最佳(实验性和对抗性)保证的第一个理论结果。