We propose a novel contextual bandit algorithm for generalized linear rewards with an $\tilde{O}(\sqrt{\kappa^{-1} \phi T})$ regret over $T$ rounds where $\phi$ is the minimum eigenvalue of the covariance of contexts and $\kappa$ is a lower bound of the variance of rewards. In several practical cases where $\phi=O(d)$, our result is the first regret bound for generalized linear model (GLM) bandits with the order $\sqrt{d}$ without relying on the approach of Auer [2002]. We achieve this bound using a novel estimator called double doubly-robust (DDR) estimator, a subclass of doubly-robust (DR) estimator but with a tighter error bound. The approach of Auer [2002] achieves independence by discarding the observed rewards, whereas our algorithm achieves independence considering all contexts using our DDR estimator. We also provide an $O(\kappa^{-1} \phi \log (NT) \log T)$ regret bound for $N$ arms under a probabilistic margin condition. Regret bounds under the margin condition are given by Bastani and Bayati [2020] and Bastani et al. [2021] under the setting that contexts are common to all arms but coefficients are arm-specific. When contexts are different for all arms but coefficients are common, ours is the first regret bound under the margin condition for linear models or GLMs. We conduct empirical studies using synthetic data and real examples, demonstrating the effectiveness of our algorithm.
翻译:我们为一般线性报酬提出了一个新的背景强盗算法, 其价格为$\ tilde{ O}(\ sqrt kapa}-1}\ fi T}) 。 美元是环境共差的最低值, 美元是回报差异的下限。 在一些实际例子中, 美元= O(d) 美元, 我们的结果是对通用线性模型( GLM) 匪徒的第一个遗憾, 其顺序为$\sqrt{d} $, 不依赖于 Auer[ 2002] 的方法。 我们使用名为双双双色调( DDD) 估量( DDR), 美元是双色调( DKPA) 值的下值, 而 美元(KAuer) 的方法通过放弃观察到的奖赏, 我们的算法在各种情况下都实现了独立。 我们还提供$( kapa) - 1美元 - 1美元