非稳定配给强盗的最佳和高效动态抑制力比值 (Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits)

We study the problem of \emph{dynamic regret minimization} in $K$-armed Dueling Bandits under non-stationary or time varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary `win-loss' feedback for this pair, sampled from an underlying preference matrix at that round. We first study the problem of static-regret minimization for adversarial preference sequences and design an efficient algorithm with $O(\sqrt{KT})$ high probability regret. We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities. In particular, we establish $\tO(\sqrt{SKT})$ and $\tO({V_T^{1/3}K^{1/3}T^{2/3}})$ dynamic-regret guarantees, $S$ being the total number of `effective-switches' in the underlying preference relations and $V_T$ being a measure of `continuous-variation' non-stationarity. The complexity of these problems have not been studied prior to this work despite the practicability of non-stationary environments in real world systems. We justify the optimality of our algorithms by proving matching lower bound guarantees under both the above-mentioned notions of non-stationarities. Finally, we corroborate our results with extensive simulations and compare the efficacy of our algorithms over state-of-the-art baselines.

翻译：我们用非固定或时间偏好来研究以K$为单位的斗牛贼在非固定或时间差异的偏好下如何最小化的问题。这是一个在线学习设置, 代理商在每轮中选择一对物品, 并只观察对这对物品的相对二进制“ 双败” 反馈, 从该回合的基本优惠矩阵中取样。我们首先研究对冲优惠序列的静态最小化问题, 并设计一个具有以美元( sqrt{KT}) 为单位的高效算法。我们接下来使用类似的算法想法来提议一个高效和可变的最佳算法, 在两种非静止概念下, 以动态最小化为单位的最小化, 特别是我们建立美元( sqrt{SKT}) 和 $tO (V_T ⁇ 1/3}K}K}1/3}T ⁇ regret $, 动态- regret 保证, 美元是基本优惠关系中“ 有效转换” 和 $_T$_trest 最佳算法的算算法, 在不固定的逻辑上, 我们的不固定的逻辑环境下, 在不固定的逻辑上, 在不固定的逻辑上, 在不固定的逻辑上, 我们的逻辑上, 在不固定的逻辑上, 在不固定的逻辑上,我们不固定的逻辑上, 的逻辑上, 的逻辑上, 在不反复的逻辑上,我们不反复的逻辑上, 我们的逻辑上, 和不反复的逻辑的逻辑上, 的逻辑上, 。