We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
翻译:我们为线性背景强盗建议了一个新奇的算法,使用$O(\\ sqrt{dTT\log T}) 表示歉意, 美元是环境的维度, 美元是时间范围。 我们提议的算法配备了一个新颖的测算器, 通过明确的随机化嵌入勘探。 取决于随机化, 我们提议的测算器会从所有手臂或选定环境的背景中做出贡献。 我们为我们的测算仪建立一个自我标准化的定线, 允许将累积的差分解成基于添加维度的术语, 而不是乘数性术语。 在我们的问题设置下, 我们还证明了一个小于$\ Omega( sqrt{dT}) 的新颖的下限。 因此, 我们提议的算法的遗憾匹配了较低约束到对线性系数。 数字实验支持理论保证, 并显示我们提议的算法比现有的线式带算法要优。