We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons. Previous learning algorithms have focused on the $\textit{fully adaptive}$ setting, where the algorithm can make updates after every comparison. The "batched" dueling bandit problem is motivated by large-scale applications like web search ranking and recommendation systems, where performing sequential updates may be infeasible. In this work, we ask: $\textit{is there a solution using only a few adaptive rounds that matches the asymptotic regret bounds of the best sequential algorithms for $K$-armed dueling bandits?}$ We answer this in the affirmative $\textit{under the Condorcet condition}$, a standard setting of the $K$-armed dueling bandit problem. We obtain asymptotic regret of $O(K^2\log^2(K)) + O(K\log(T))$ in $O(\log(T))$ rounds, where $T$ is the time horizon. Our regret bounds nearly match the best regret bounds known in the fully sequential setting under the Condorcet condition. Finally, in computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(\log(T))$ rounds achieves almost the same performance as fully sequential algorithms (that use $T$ rounds).
翻译:我们研究的是以对称比较的形式获得反馈的传统多武装土匪问题。 以往的学习算法侧重于 $\ textit{ 完全适应} $ 设置, 算法可以在每次比较后进行更新。 “ 捆绑” 的土匪问题是由大规模应用的驱动的, 比如网络搜索排名和建议系统, 进行顺序更新可能不可行 。 在这项工作中, 我们问 : $\ textit{ 是一个解决方案, 仅使用几个适应性回合, 该回合与 $K 配备的土匪的最佳顺序算法的无效果的遗憾界限相匹配 。 } $T$ 我们的答案是肯定的 $\ textitilit{ 在康多采特条件下 $ 。 标准设置 $K$ 的土匪问题 。 我们得到的是相同的 O ( K2\ log2) + O ( K\ ) ( T) $( T) + ( K) $ ( T) $ ( T) ) 的适应性回合的适应性回合 。 $T$ ( 美元 ) 最佳的顺序算法 。, 几乎在时间顺序里 里, 里, 我们所知道 的 的 的 解的 。