Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020a); Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.
翻译:效率高的背景强盗往往基于利用过去的数据来估计一种预测模型,根据不同的背景和武器来估计奖赏。然而,当奖赏模型没有很好地指定时,土匪算法可能会引起意外的遗憾,因此最近的工作侧重于强于偏差的算法。我们建议建立一个简单的背景强盗算法组合,在有证据表明错误区分正在导致遗憾增加时,通过恢复到一个良好的安全政策来适应错误的区分错误。我们的算法只需要一个离线的回归或触角,以确保在衡量平均误差水平时出现优减的遗憾保证。与以前的工作相比,我们获得了类似的遗憾保证,但我们并不依赖主算法,而不需要像在线或受限制的回归或触法(例如,Foster等人(2020年a);Krishnamurthy 等人(202020年)那样的更坚固的手法或手法。这使我们能够为更普遍的功能近似等级设计算法。