The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.
翻译:已经广泛研究了以吵闹的对称比较为形式的反馈问题。以前的工作仅侧重于每次比较后政策调整的顺序设置。然而,在搜索排名和建议系统等许多应用中,最好在数量有限的平行批量中进行比较。我们研究在两个标准设置下分批处理的以美元计价的土匪问题:(一) 存在一个Condorcet赢家,和(二) 强大的随机中转性和三角性不平等。在这两种设置下,我们获得的算法在批量和遗憾数量之间平稳取舍。我们的遗憾界限与已知的最佳连续遗憾界限(多对数因素)相匹配,只使用数量对数的批量。我们用几乎接近一致的下限来补充我们的遗憾分析。最后,我们还通过合成和真实数据的实验来验证我们的理论结果。