Model misspecification is a major consideration in applications of statistical methods and machine learning. However, it is often neglected in contextual bandits. This paper studies a common form of misspecification, an inter-arm heterogeneity that is not captured by context. To address this issue, we assume that the heterogeneity arises due to arm-specific random variables, which can be learned. We call this setting a robust contextual bandit. The arm-specific variables explain the unknown inter-arm heterogeneity, and we incorporate them in the robust contextual estimator of the mean reward and its uncertainty. We develop two efficient bandit algorithms for our setting: a UCB algorithm called RoLinUCB and a posterior-sampling algorithm called RoLinTS. We analyze both algorithms and bound their $n$-round Bayes regret. Our experiments show that RoLinTS is comparably statistically efficient to the classic methods when the misspecification is low, more robust when the misspecification is high, and significantly more computationally efficient than its naive implementation.
翻译:模型误差是应用统计方法和机器学习的一个主要考虑因素。 但是, 模型误差往往被背景强盗忽略。 本文研究一种常见的误差形式, 一种没有被上下文所捕捉到的武器间差异性。 为了解决这个问题, 我们假设, 差异性是由于可学的、 特定手臂随机变量造成的。 我们称此为强力背景强盗。 具体手臂变数解释了未知的武器间差异性, 我们将这些变数纳入一个强大的平均奖赏和不确定性的背景估计器中。 我们为我们的设置开发了两种高效的土匪算法: 一个叫RoLinUCB的UCB算法和一个叫RoLintS的远地点抽样算法。 我们分析两种算法, 并约束其每圆湾$的随机变量。 我们的实验显示, 当误差时, RoLinTS在统计上与经典方法相对有效, 当误差时, 当标值高时, 当误差时, 并且计算效率远高于天真执行时, 我们的实验显示, RoLinTS在统计上比较有效。