A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, but typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexible, yet degrade gracefully in the face of model misspecification? We introduce a new family of oracle-efficient algorithms for $\varepsilon$-misspecified contextual bandits that adapt to unknown model misspecification -- both for finite and infinite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and -- in particular -- optimal dependence on the misspecification level, with no prior knowledge. Specializing to linear contextual bandits with infinite actions in $d$ dimensions, we obtain the first algorithm that achieves the optimal $O(d\sqrt{T} + \varepsilon\sqrt{d}T)$ regret bound for unknown misspecification level $\varepsilon$. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin, which we anticipate will find broader use.
翻译:环境强盗的主要研究方向是开发计算效率高的算法,但支持灵活、通用功能近似。基于模型奖励的算法已经表现出很强的经验性表现,但通常需要精确的模型,如果这一假设不成立,就会失败。我们能否设计高效和灵活的算法,但面对模型的偏差而优雅地降解?我们为美元和瓦列普西隆特特特奇特特土匪引入一个新的算法组合,以适应未知的模型偏差 -- -- 无论是有限的还是无限的行动设置。鉴于在平方损失回归方面可以访问在线或触角,我们的算法会取得最佳的遗憾,特别是最佳地依赖偏差的定位水平,而事先没有这方面的知识。我们能否设计出一个高效和灵活的算法,但面对模型的偏差,我们获得了第一个实现最佳的 $( dqqrt{T} + varepsilon\ sqrt{t}T) 奇特的算法组合,对于未知的误差分化水平 $\ vareplon 或无限的动作设置。在概念上,我们通过一个更广义的递化框架,我们的结果将会通过一个新的递减变后得到实现。