We prove an instance independent (poly) logarithmic regret for stochastic contextual bandits with linear payoff. Previously, in \cite{chu2011contextual}, a lower bound of $\mathcal{O}(\sqrt{T})$ is shown for the contextual linear bandit problem with arbitrary (adversarily chosen) contexts. In this paper, we show that stochastic contexts indeed help to reduce the regret from $\sqrt{T}$ to $\polylog(T)$. We propose Low Regret Stochastic Contextual Bandits (\texttt{LR-SCB}), which takes advantage of the stochastic contexts and performs parameter estimation (in $\ell_2$ norm) and regret minimization simultaneously. \texttt{LR-SCB} works in epochs, where the parameter estimation of the previous epoch is used to reduce the regret of the current epoch. The (poly) logarithmic regret of \texttt{LR-SCB} stems from two crucial facts: (a) the application of a norm adaptive algorithm to exploit the parameter estimation and (b) an analysis of the shifted linear contextual bandit algorithm, showing that shifting results in increasing regret. We have also shown experimentally that stochastic contexts indeed incurs a regret that scales with $\polylog(T)$.
翻译:我们用线性报酬来证明对随机背景土匪来说,我们是一个独立(poly)的对数遗憾。 之前, 在\ cite{ chu2011 contextual} (cite{chu2011contextulal}}) 中, 在任意( 反选择) 的背景下, 对背景线性土匪问题显示较低约束$\mathcal{O} (\ sqrt{T}) (sqrt{T}) 。 在本文中, 我们显示, 随机环境确实有助于将遗憾从$\ sqrt{T} 降低到 $\polylogy( T) 。 我们提议低调( poly) 斯托克托特( textt{ LR- SCB}) 的背景土匪(\ texttt{LR- SCB} ), 利用随机环境环境环境的下限进行下限估计, 并同时进行参数估计( $2美元) 和遗憾的最小的参数估计 。 (attrodustrisal ad) 分析, 显示我们的逻辑背景背景背景背景的演化结果。