Online learning to rank (OLTR) interactively learns to choose lists of items from a large collection based on certain click models that describe users' click behaviors. Most recent works for this problem focus on the stochastic environment where the item attractiveness is assumed to be invariant during the learning process. In many real-world scenarios, however, the environment could be dynamic or even arbitrarily changing. This work studies the OLTR problem in both stochastic and adversarial environments under the position-based model (PBM). We propose a method based on the follow-the-regularized-leader (FTRL) framework with Tsallis entropy and develop a new self-bounding constraint especially designed for PBM. We prove the proposed algorithm simultaneously achieves $O(\log{T})$ regret in the stochastic environment and $O(m\sqrt{nT})$ regret in the adversarial environment, where $T$ is the number of rounds, $n$ is the number of items and $m$ is the number of positions. We also provide a lower bound of order $\Omega(m\sqrt{nT})$ for adversarial PBM, which matches our upper bound and improves over the state-of-the-art lower bound. The experiments show that our algorithm could simultaneously learn in both stochastic and adversarial environments and is competitive compared to existing methods that are designed for a single environment.
翻译:在线学习对( OLTR) 进行互动排序( OLTR), 以根据描述用户点击行为的某些点击模型从大型收藏中选择项目列表。 这个问题的近期工作大多侧重于在学习过程中假定该物品的吸引力是变化无常的随机环境。 然而, 在许多现实世界情景中, 环境可能是动态的, 甚至任意变化。 这项工作在基于位置的模型( PBM) 下研究在随机和对抗环境中的 OLTR 问题。 我们提出了一个基于后续固定领导( FTRL) 框架( FTRL) 的方法, 以 Tsalllis entropy (FTRL) 为基础, 并开发了专门为 PBM 设计的新的自我约束限制环境。 我们证明, 提议的算法同时实现了 $O( log{ T} ), 在基于基于位置的模型( $T$T) 和 对抗环境中的 $O( mqr) 令人遗憾。, 美元可以是回合中的竞争性项目数量, 美元, 和 美元是用于我们当前测试中的高级环境。 我们还设定的 $\\\ Om_ 校程- trest- trest- 和 展示的排序中, 和 显示一个更低的比 。