The $K$-armed dueling bandits problem, where the feedback is in the form of noisy pairwise preferences, has been widely studied due its applications in information retrieval, recommendation systems, etc. Motivated by concerns that user preferences/tastes can evolve over time, we consider the problem of dueling bandits with distribution shifts. Specifically, we study the recent notion of significant shifts (Suk and Kpotufe, 2022), and ask whether one can design an adaptive algorithm for the dueling problem with $O(\sqrt{K\tilde{L}T})$ dynamic regret, where $\tilde{L}$ is the (unknown) number of significant shifts in preferences. We show that the answer to this question depends on the properties of underlying preference distributions. Firstly, we give an impossibility result that rules out any algorithm with $O(\sqrt{K\tilde{L}T})$ dynamic regret under the well-studied Condorcet and SST classes of preference distributions. Secondly, we show that $\text{SST} \cap \text{STI}$ is the largest amongst popular classes of preference distributions where it is possible to design such an algorithm. Overall, our results provides an almost complete resolution of the above question for the hierarchy of distribution classes.
翻译:由于对用户偏好/任务随时间变化的担忧,我们考虑的是随着分配变化而决断土匪的问题。具体地说,我们研究的是最近的重大转变概念(Suk和Kpoteufe,2022年),并询问能否设计出一种适应性算法来应对以美元(sqrt{K\tilde{L}T})为形式的决断问题。第二,我们展示出$\text{ST}{L}$是(未知的)重大偏好变化的数量。我们表明,这个问题的答案取决于基本优惠分配的特性。首先,我们无法得出一个结果,即用美元(sqrt{Ktilde{L}T})来规范任何与美元(sukrent condorcet & SST) 类优惠分配的算法。第二,我们展示出,美元(text{L}$) $是(未知的) 和Textlix 等级分配结果,在这种等级上,我们最有可能的等级分配为整个的等级。