This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^{\kappa}(T)$ for some $\kappa>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.
翻译:本文考虑的是具有一定数量武器的背景土匪, 其环境是独立的, 并且以相同的方式分布了美元维度随机矢量, 而预期的回报在手臂参数和背景中都是线性的。 LinUCB 算法对于相关的线性土匪来说接近最优的迷你马克斯, 其累积的遗憾在规模上和时间范围上都低于最优值, 因为它的过度勘探, 并且是一个小的系统, 并提议了一个名为“ Tr- LinUCB ” 的截断版本, 其名称是“ Tr- LinUCB ”, 其尾随LUCB 的运行时间到一个临时时间段, 直至交替时间段 $ 。 如果 Tr- LUCB 算法算法在某个最优值上比值, 而不是 TRIL$。