We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d^{7/8} B_T^{1/4}T^{3/4})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.
翻译:我们调查两种扰动方法,以克服基于乐观的算法在实际中长期受到破坏的保守主义。第一种方法是在使用信任套时以简单的随机化取代乐观主义。第二种办法是在最大预期报酬之前将随机扰动添加到目前的估计中。对于非静止线性土匪,每次行动都与美元维观特征和未知参数相关联的时间变化与总变差$B_T$(T_8}B_T ⁇ 1/4}T%3/4}(美元)有关,我们建议两种随机化算法:分散随机LinUCB(D-RandLinUCB)和错计线性线性汤普森取样(D-LinTS),通过两种扰动方法。我们强调它们之间的统计最佳性与计算效率交易。我们强调它们之间的统计最佳性与计算性交易,因为前者以最佳的动态遗憾(d ⁇ 7/8}B_T ⁇ 1/4}(B_T_Q_T ⁇ 3/4}(美元)有关,但后者具有额外对武器数量与微轴-opIls-optalal 的逻辑论调的逻辑系数。在模拟研究中,用极的模拟研究中,用极的解解的成绩解。