优化在线评分系统作为强盗 (Optimizing Ranking Systems Online as Bandits)

Ranking system is the core part of modern retrieval and recommender systems, where the goal is to rank candidate items given user contexts. Optimizing ranking systems online means that the deployed system can serve user requests, e.g., queries in the web search, and optimize the ranking policy by learning from user interactions, e.g., clicks. Bandit is a general online learning framework and can be used in our optimization task. However, due to the unique features of ranking, there are several challenges in designing bandit algorithms for ranking system optimization. In this dissertation, we study and propose solutions for four challenges in optimizing ranking systems online: effectiveness, safety, nonstationarity, and diversification. First, the effectiveness is related to how fast the algorithm learns from interactions. We study the effective online ranker evaluation task and propose the MergeDTS algorithm to solve the problem effectively. Second, the deployed algorithm should be safe, which means the algorithm only displays reasonable content to user requests. To solve the safe online learning to rank problem, we propose the BubbleRank algorithm. Third, as users change their preferences constantly, the algorithm should handle the nonstationarity. We formulate this nonstationary online learning to rank problem as cascade non-stationary bandits and propose CascadeDUCB and CascadeSWUCB algorithms to solve the problem. Finally, the contents in ranked lists should be diverse. We consider the results diversification task and propose the CascadeHybird algorithm that considers both the item relevance and results diversification when learning from user interactions.

翻译：排名系统是现代检索和建议系统的核心部分,目标是根据用户背景对候选项目进行排名。优化在线排名系统意味着部署的系统能够满足用户的要求,例如网上搜索中的查询,并通过从用户互动中学习(例如点击)优化排名政策。班迪是一个普遍的在线学习框架,可用于优化任务。然而,由于排名的独特性,在设计用于评级系统优化的巡回算法方面存在若干挑战。在这一解说中,我们研究并提出了在优化在线排名系统方面存在的四项挑战:有效性、安全性、非静态和多样化。首先,所部署的系统的有效性与算法从互动中学习的速度有关。我们研究有效的在线排名评估任务,并提出MergeDTS算法以有效解决问题。第二,部署的算法应当是安全的,这意味着算法只能为用户请求显示合理的内容。为了解决安全在线学习的排名问题,我们建议采用BubleRank 算法。第三,用户在不断改变其偏好其偏好性时, 算法应该处理不固定的升级任务。