最小程度最小化的托盘环境比量强盗 (Regret Minimization in Stochastic Contextual Dueling Bandits)

We consider the problem of stochastic $K$-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferences--famously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees $\tilde O(d\sqrt{T})$ and $\tilde O(\sqrt{dT \log K})$. Subsequently we also show that $\Omega(\sqrt {dT})$ is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.

翻译：我们考虑的是背景背景环境中以武装为武装的Stochatic $K$决斗土匪的问题,每个学习者在每回合中都会看到一套以美元为单位的上下文,每回合都有以美元为单位的项目,每个以美元维特矢量为单位,学习者的目标是确定每个背景组合中最好的手臂。然而,与古典背景土匪设置不同,我们的框架只允许学习者从(有声)有价的偏好得到项目反馈,作为各种在线决策情景中的实际利益,例如,推荐者系统、信息检索、比赛排名,常常更容易获得项目的相对强度,而不是其绝对分数。然而,据我们所知,这项工作首先考虑的是最小化因背景而成的土匪对于潜在无限决策空间的最小化问题,并给出与下限分析相匹配的最佳算法。我们用两种算法来保证在各种在线决策情景中,例如:推荐者系统、信息检索、比赛排名中往往更容易获得项目的相对强度。我们所展示的直径的奥克马的直径分析也是我们的直程。