We propose a linear contextual bandit algorithm with $O(\sqrt{dT\log T})$ regret bound, where $d$ is the dimension of contexts and $T$ isthe time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contributions either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into \textit{additive} dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
翻译:我们提出了一种具有$O(\sqrt{dT\log T})$遗憾的线性文本广告策略算法,其中$d$是背景的维数,$T$是时间考虑。我们的算法配备了一种新的估计器,其中探测通过显式随机化进行嵌入。根据随机化,我们的估计器从所有武器的背景或从选择的背景中贡献。我们为我们的估计器建立了自规范上限,它允许将累积遗憾分解为\textit{加性}维度相关项,而不是乘法项。我们还证明,我们的问题设置下新的下限为$\Omega(\sqrt{dT})$。因此,我们提出的算法的遗憾与下限匹配,直到对数因子。数值实验支持理论保证,并表明我们提出的方法优于现有线性策略算法。