We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class up to a bounded misspecification level $\zeta>0$. We propose an algorithm based on a novel data selection scheme, which only selects the contextual vectors with large uncertainty for online regression. We show that, when the misspecification level $\zeta$ is dominated by $\tilde O (\Delta / \sqrt{d})$ with $\Delta$ being the minimal sub-optimality gap and $d$ being the dimension of the contextual vectors, our algorithm enjoys the same gap-dependent regret bound $\tilde O (d^2/\Delta)$ as in the well-specified setting up to logarithmic factors. In addition, we show that an existing algorithm SupLinUCB (Chu et al., 2011) can also achieve a gap-dependent constant regret bound without the knowledge of sub-optimality gap $\Delta$. Together with a lower bound adapted from Lattimore et al. (2020), our result suggests an interplay between misspecification level and the sub-optimality gap: (1) the linear contextual bandit model is efficiently learnable when $\zeta \leq \tilde O(\Delta / \sqrt{d})$; and (2) it is not efficiently learnable when $\zeta \geq \tilde \Omega({\Delta} / {\sqrt{d}})$. Experiments on both synthetic and real-world datasets corroborate our theoretical results.
翻译:在错误指定设置中, 我们研究线性背景土匪, 期望的奖赏功能可以通过线性功能类来近似于 $\zeta>0$。 我们提出基于新数据选择方案的一种算法, 它只选择对在线回归有很大不确定性的背景矢量。 我们显示, 当错误的量化水平由$\tilde O (\Delta /\ sqrt{d}) 美元占主导地位时, 美元( Delta$) 是最小的亚最佳度差距, 美元是背景矢量的维度值。 我们的算法拥有相同的基于差距的遗憾 $\\zeta O (d2\\ Delta), 在精确设置对线性因素的设置中, 仅选择 $\ telde O (d2\ Delta) 。 此外, 我们显示现有的SupLinCU( Chu 等人, 2011年) 也可以在不了解亚优度的合成差距的情况下, $\ delta deal deal developteal 和 adal demodistrationalalalalalalal: (20) 我们的模型学习 和直观的数值。</s>