神经定级模型是强健的吗? (Are Neural Ranking Models Robust?)

Recently, we have witnessed the bloom of neural ranking models in the information retrieval (IR) field. So far, much effort has been devoted to developing effective neural ranking models that can generalize well on new data. There has been less attention paid to the robustness perspective. Unlike the effectiveness which is about the average performance of a system under normal purpose, robustness cares more about the system performance in the worst case or under malicious operations instead. When a new technique enters into the real-world application, it is critical to know not only how it works in average, but also how would it behave in abnormal situations. So we raise the question in this work: Are neural ranking models robust? To answer this question, firstly, we need to clarify what we refer to when we talk about the robustness of ranking models in IR. We show that robustness is actually a multi-dimensional concept and there are three ways to define it in IR: 1) The performance variance under the independent and identically distributed (I.I.D.) setting; 2) The out-of-distribution (OOD) generalizability; and 3) The defensive ability against adversarial operations. The latter two definitions can be further specified into two different perspectives respectively, leading to 5 robustness tasks in total. Based on this taxonomy, we build corresponding benchmark datasets, design empirical experiments, and systematically analyze the robustness of several representative neural ranking models against traditional probabilistic ranking models and learning-to-rank (LTR) models. The empirical results show that there is no simple answer to our question. While neural ranking models are less robust against other IR models in most cases, some of them can still win 1 out of 5 tasks. This is the first comprehensive study on the robustness of neural ranking models.

翻译：最近,我们目睹了信息检索(IR)领域神经等级模型的涌现。到目前为止,我们花费了大量精力来开发有效的神经等级模型,这些模型能够对新数据进行全面推广。强性观点没有得到足够重视。与正常目的下系统平均性能的实效不同, 强性更多地关注系统在最坏情况下或恶意操作下的业绩。当新技术进入现实世界应用时, 关键是不仅知道它如何在平均水平上运作, 而且知道它在异常情况下会如何行事。因此,我们在此工作中提出了问题: 神经等级模型是否强大? 首先,我们需要澄清我们在谈论IR中排名模型的稳性时指的是什么。我们表明强性是一个多维的概念, 在IR中,有三种方法来定义它的排名:(1) 独立和同样分布的(I. I. D.) 模式下的绩效差异仍然存在;(2) 错失( OD) 总体性; 和 3) 防御性模型相对于最强性的全面操作。后两种定义是,我们相对性排序的排序, 两种排序是不同的排序, 我们的排序是不同的排序, 我们的排序是不同的排序, 排序中, 排序是不同的, 排序中, 两种是排序中, 排序中, 不同的, 不同的, 排序是排序中, 排序中, 不同的, 不同的, 排序是排序中, 排序中, 排序中, 排序中的数据是, 不同的是不同的, 。