Existing online learning to rank (OL2R) solutions are limited to linear models, which are incompetent to capture possible non-linear relations between queries and documents. In this work, to unleash the power of representation learning in OL2R, we propose to directly learn a neural ranking model from users' implicit feedback (e.g., clicks) collected on the fly. We focus on RankNet and LambdaRank, due to their great empirical success and wide adoption in offline settings, and control the notorious explore-exploit trade-off based on the convergence analysis of neural networks using neural tangent kernel. Specifically, in each round of result serving, exploration is only performed on document pairs where the predicted rank order between the two documents is uncertain; otherwise, the ranker's predicted order will be followed in result ranking. We prove that under standard assumptions our OL2R solution achieves a gap-dependent upper regret bound of $O(\log^2(T))$, in which the regret is defined on the total number of mis-ordered pairs over $T$ rounds. Comparisons against an extensive set of state-of-the-art OL2R baselines on two public learning to rank benchmark datasets demonstrate the effectiveness of the proposed solution.
翻译:现有的在线学习排名( OL2R) 解决方案仅限于线性模型,这些模型无法捕捉查询和文件之间可能的非线性关系。在这项工作中,为了释放在OL2R中代表学习的力量,我们提议直接从从从飞上收集的用户的隐含反馈(例如点击)中学习神经排序模式。我们把重点放在RankNet和LamdaRank上,因为它们在离线环境中取得了巨大的成功和广泛采用,并控制了臭名昭著的探索-开发交易,其依据是利用神经核内核对神经网络的趋同分析。具体地说,在每轮服务结果中,只有在两个文件之间预测的排名顺序不确定的情况下,才对文件对进行勘探;否则,将遵循排名人的预测顺序进行结果排序。我们证明,根据标准假设,我们的OL2R解决方案达到了以美元( log=2( T) $( $) = ( T) = ) 的顶级遗憾,其中对超过$( $ $ $ $ $ ( T) ) 内核内核内核) 的错定总对两轮的两对相表示遗憾。对照,对照了拟议基准数据库数据库数据库显示为广泛的基准数据库数据库数据库数据库数据库数据库。