We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{\Delta_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{\Delta_i\}_{i = 1}^K$. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods.
翻译:我们研究的是使用$K美元配制的土匪问题, 研究的是使用$K美元配制的土匪问题。 学习者的目标是通过相对偏好在在线顺序上对一对决定点进行查询,通过在线顺序排列方式对信息进行汇总。 我们首先建议从任何(一般)配制土匪到多武装土匪进行新的削减, 尽管简单, 这使得我们能够改善在配制土匪方面的许多现有结果。 特别是, 我们给配制土匪时的贬低率问题提供了世界第一最佳结果。 是一个更宽泛的框架,保证既能同时以最佳的方式执行随机偏差和对抗对立的偏差。 此外,我们的算法也是第一个实现最优化的 $(=====1 ⁇ K) 土匪向多武装强盗交配, 以最优的方式衡量现有的美元比额, 以更优的相对差的基比值差距 =1 ⁇ 基 。 这可以解决长期的马戏法问题, 以最稳的比重的比重法方法来, 将我们最稳的基的比重地研究。