Users' clicks on Web search results are one of the key signals for evaluating and improving web search quality and have been widely used as part of current state-of-the-art Learning-To-Rank(LTR) models. With a large volume of search logs available for major search engines, effective models of searcher click behavior have emerged to evaluate and train LTR models. However, when modeling the users' click behavior, considering the bias of the behavior is imperative. In particular, when a search result is not clicked, it is not necessarily chosen as not relevant by the user, but instead could have been simply missed, especially for lower-ranked results. These kinds of biases in the click log data can be incorporated into the click models, propagating the errors to the resulting LTR ranking models or evaluation metrics. In this paper, we propose the De-biased Reinforcement Learning Click model (DRLC). The DRLC model relaxes previously made assumptions about the users' examination behavior and resulting latent states. To implement the DRLC model, convolutional neural networks are used as the value networks for reinforcement learning, trained to learn a policy to reduce bias in the click logs. To demonstrate the effectiveness of the DRLC model, we first compare performance with the previous state-of-art approaches using established click prediction metrics, including log-likelihood and perplexity. We further show that DRLC also leads to improvements in ranking performance. Our experiments demonstrate the effectiveness of the DRLC model in learning to reduce bias in click logs, leading to improved modeling performance and showing the potential for using DRLC for improving Web search quality.
翻译:在网络搜索结果上点击用户是评估和改进网络搜索质量的关键信号之一,并且被广泛用作当前最新水平的“学习到兰克”(LTR)模型的一部分。在主要搜索引擎有大量搜索日志的情况下,产生了有效的搜索者点击行为模型来评估和训练LTR模型。然而,在模拟用户点击行为时,考虑到行为偏差,有必要考虑行为偏差。特别是,当搜索结果不被点击时,它不一定被用户选择为不相关,而是被简单地忽略了,特别是对于排名较低的结果。点击日志数据中的这些偏差可以纳入点击模型,将错误推广到由此产生的 LTR 排名模型或评价指标中。在本文中,我们提议了不偏差的加强学习学习学习模式(DRLC)。DLLC模型以前对用户的检查行为和产生潜伏状态作了一些假设。为了实施DRLC模型,在使用实验室模型模型模型模型搜索网络时,将移动神经网络用作强化的搜索网络,在使用 RDR 测试中,在测试中,我们用已经培训的测试了一个状态的校正的校正,以显示我们先前的校正。