Linear contextual bandit is an important class of sequential decision making problems with a wide range of applications to recommender systems, online advertising, healthcare, and many other machine learning related tasks. While there is a lot of prior research, tight regret bounds of linear contextual bandit with infinite action sets remain open. In this paper, we address this open problem by considering the linear contextual bandit with (changing) infinite action sets. We prove a regret upper bound on the order of $O(\sqrt{d^2T\log T})\times \text{poly}(\log\log T)$ where $d$ is the domain dimension and $T$ is the time horizon. Our upper bound matches the previous lower bound of $\Omega(\sqrt{d^2 T\log T})$ in [Li et al., 2019] up to iterated logarithmic terms.
翻译:线性背景土匪是一个重要的顺序决策问题类别, 涉及到推荐系统、 在线广告、 医疗保健和其他许多机器学习相关任务的多种应用。 虽然有许多先前的研究, 但线性背景土匪与无限动作组的严格遗憾界限仍然开放 。 在本文中, 我们通过考虑线性背景土匪与( 更改) 无限动作组来解决这个问题 。 我们证明对 $O (\\ sqrt{ d2T\log T}\ t)\ times\ text{poly} (\log\log t) (poly) (\log\ t) ($d) 是域维度, $T$ 是时间范围。 我们的上边框匹配了在 [ Li 和 al, 2019] 至 迭代对数术语中的 $\ Omega (\ sqrt{ d ⁇ 2 T\log T} 之前较低的约束值。