A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (\texttt{LinTS}). Different from previous works relying on missing data techniques (\citet{dimakopoulou2019balanced}, \citet{kim2019doubly}), the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $\tilde{O}(\phi^{-2}\sqrt{T})$, where $\phi^2$ is the minimum eigenvalue of the covariance matrix of contexts. This is the first regret bound of \texttt{LinTS} using $\phi^2$ without the dimension of the context, $d$. Applying the relationship between $\phi^2$ and $d$, the regret bound of the proposed algorithm is $\tilde{O}(d\sqrt{T})$ in many practical scenarios, improving the bound of \texttt{LinTS} by a factor of $\sqrt{d}$. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of \texttt{LinTS}. Empirical studies show the advantage of the proposed algorithm over \texttt{LinTS}.
翻译:线性回报的双重稳健 Thompson 抽样
翻译后的摘要:
针对赌徒问题的一个挑战性方面是,只有所选择的手臂的随机奖励得到观测,而其他手臂的奖励则保持缺失。手臂选择对于过去上下文和奖励对的依赖增加了后悔分析的复杂性。我们提出了一种新型的多臂上下文赌徒算法 Doubly Robust(DR)Thompson Sampling,采用了用于缺失数据文献中使用的双重稳健估计器,用于有上下文的 Thompson Sampling。与先前依赖于缺失数据技术的工作不同(Dimakopoulou et al.,2019;Kim et al.,2019),所提出的算法旨在允许一种新的可加性后悔分解,从而导致改进的后悔界限,其顺序为 $\tilde{O}(\phi^{-2}\sqrt{T})$,其中 $\phi^2$ 是上下文协方差矩阵的最小特征值。这是第一个使用 $\phi^2$ 而非上下文维数 $d$ 的 \texttt{LinTS} 后悔界线。应用 $\phi^2$ 和 $d$ 之间的关系,所提出的算法的后悔界线在许多实际情况下为 $\tilde{O}(d\sqrt{T})$,将 \texttt{LinTS} 的界限提高了 $\sqrt{d}$ 倍。该方法的一个好处是它利用了所有上下文数据,无论是否选择,从而允许规避理论分析中 \texttt{LinTS} 未饱和手臂的技术定义。实证研究显示了所提出算法的优势。