Tractable contextual bandit algorithms often rely on the realizability assumption - i.e., that the true expected reward model belongs to a known class, such as linear functions. In this work, we present a tractable bandit algorithm that is not sensitive to the realizability assumption and computationally reduces to solving a constrained regression problem in every epoch. When realizability does not hold, our algorithm ensures the same guarantees on regret achieved by realizability-based algorithms under realizability, up to an additive term that accounts for the misspecification error. This extra term is proportional to T times a function of the mean squared error between the best model in the class and the true model, where T is the total number of time-steps. Our work sheds light on the bias-variance trade-off for tractable contextual bandits. This trade-off is not captured by algorithms that assume realizability, since under this assumption there exists an estimator in the class that attains zero bias.
翻译:可变背景土匪算法往往依赖可变性假设,即真实预期的奖励模式属于已知的类别,如线性函数。在这项工作中,我们提出了一个对可变性假设不敏感的可移动土匪算法,并在计算上缩小到在每个时代解决受限制的回归问题。当不可变性不起作用时,我们的算法保证在可变性下通过可变性基的可变性算法实现的遗憾得到同样的保证,直到一个计算误差的添加术语。这个额外术语与该类中最佳模型和真实模型之间的平均正方形错误的函数乘以T的倍,而T是总的时间步骤。我们的工作揭示了可变性背景强盗的偏差交易。这种交易没有被假定为可变性的算法所捕捉到,因为根据这一假设,在类中有一个达到零偏差的估测值。