Dueling bandits are widely used to model preferential feedback that is prevalent in machine learning applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a new and highly expressive generalized linear dueling bandits model, which covers many existing models. Surprisingly, the Borda regret minimization problem turns out to be difficult, as we prove a regret lower bound of order $\Omega(d^{2/3} T^{2/3})$, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain the lower bound, we propose an explore-then-commit type algorithm, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. When the number of items/arms $K$ is small, our algorithm can achieve a smaller regret $\tilde{O}( (d \log K)^{1/3} T^{2/3})$ with proper choices of hyperparameters. We also conduct empirical experiments on both synthetic data and a simulated real-world environment, which corroborate our theoretical analysis.
翻译:分辨的土匪被广泛用来模拟在诸如建议系统和排名等机器学习应用中普遍存在的优惠反馈。 在本文中,我们研究了博尔达对决土匪的最小化遗憾问题,其目的是在尽量减少累积的遗憾的同时确定该物品的博尔达得分最高,同时尽量减少累积的遗憾。我们提出了一个新的和高度直观的广度线性决匪模式,它涵盖了许多现有的模式。令人惊讶的是,博尔达对最小化问题发现困难,因为我们证明对美元(d<unk> 2/3}T<unk> 2/3}T<unk> 3})命令的较低约束感到遗憾,而美元是背景矢量的维度,而美元则是时间范围。为了达到较低的约束,我们建议一种探索的当时承诺型算法,它几乎相当于对上约束的美元(d<unk> 2/3}T<unk> 2/3}T<unk> 3)。 当项目/武器价值很小时,我们的算法可以实现更小的遗憾 $\tilde{O}(d\ k} (d\ klog K} (d K=1/3} T<unk> 2/3}) 和真实的模拟环境的实验, 和模拟数据分析的正确进行。</s>