We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.
翻译:我们研究的是非常态的强盗问题,用K$军火来研究,即时平线$T$包含固定部分,每个部分都与其自身的偏好矩阵相关。学习者反复选择一对双臂,并观察它们之间的二进制偏好作为反馈。为了尽量减少累积的遗憾,学习者需要尽可能多地挑选每个固定部分的康德塞特赢家,尽管偏好矩阵和区段长度未知。我们提议用美元(mathrm){Beat\,\\\,Winner\,Resset}算法,并证明在固定案件中,它有预期的二进制微弱的遗憾,因为这会收紧当前最新算法的界限。我们还对非静止案例表示遗憾,而不需要知道$或T$。我们进一步提议并分析两种元和元($mathrm{DETECTT}$),以弱智者为遗憾和$\mathrem{Metried\,Dirling, Bridits}美元,以强烈的遗憾为约束。我们最后以最弱的算式算法,我们最后在最弱的底级算法中,最后将一个最弱的、最弱的变数级算。