Heavily pre-trained transformers for language modelling, such as BERT, have shown to be remarkably effective for Information Retrieval (IR) tasks, typically applied to re-rank the results of a first-stage retrieval model. IR benchmarks evaluate the effectiveness of retrieval pipelines based on the premise that a single query is used to instantiate the underlying information need. However, previous research has shown that (I) queries generated by users for a fixed information need are extremely variable and, in particular, (II) neural models are brittle and often make mistakes when tested with modified inputs. Motivated by those observations we aim to answer the following question: how robust are retrieval pipelines with respect to different variations in queries that do not change the queries' semantics? In order to obtain queries that are representative of users' querying variability, we first created a taxonomy based on the manual annotation of transformations occurring in a dataset (UQV100) of user-created query variations. For each syntax-changing category of our taxonomy, we employed different automatic methods that when applied to a query generate a query variation. Our experimental results across two datasets for two IR tasks reveal that retrieval pipelines are not robust to these query variations, with effectiveness drops of $\approx20\%$ on average. The code and datasets are available at https://github.com/Guzpenha/query_variation_generators.
翻译:用于语言建模(如BERT)的训练有素的语文模型变压器显示,对信息检索(IR)任务来说,通常用于对第一阶段检索模型的结果进行重新排序,通常用于对第一阶段检索模型的结果进行重新排序,非常有效。IR基准根据使用单一查询对基本信息需求进行即时处理的前提,评价回收管道的有效性。然而,先前的研究显示,(一) 用户对固定信息需求提出的查询极不稳定,特别是(二) 神经模型在经过修改的投入测试时常常犯错误。根据这些观察,我们打算回答以下问题:在不改变查询的语义变化中,对不同查询的变异进行重新排序的管道检索力度如何?为了获得代表用户查询的变异性的查询,我们首先根据对在数据集中发生的变异的手动注释(UQV100),用户创建的变异调。对于我们分类的每一种合成变异类别,我们采用两种自动方法,在查询$(ax)的调调调调调时,我们采用两种方法是用来进行不同的自动方法。