Neural ranking models (NRMs) have shown remarkable success in recent years, especially with pre-trained language models. However, deep neural models are notorious for their vulnerability to adversarial examples. Adversarial attacks may become a new type of web spamming technique given our increased reliance on neural information retrieval models. Therefore, it is important to study potential adversarial attacks to identify vulnerabilities of NRMs before they are deployed. In this paper, we introduce the Word Substitution Ranking Attack (WSRA) task against NRMs, which aims to promote a target document in rankings by adding adversarial perturbations to its text. We focus on the decision-based black-box attack setting, where the attackers have no access to the model parameters and gradients, but can only acquire the rank positions of the partial retrieved list by querying the target model. This attack setting is realistic in real-world search engines. We propose a novel Pseudo Relevance-based ADversarial ranking Attack method (PRADA) that learns a surrogate model based on Pseudo Relevance Feedback (PRF) to generate gradients for finding the adversarial perturbations. Experiments on two web search benchmark datasets show that PRADA can outperform existing attack strategies and successfully fool the NRM with small indiscernible perturbations of text.
翻译:近年来,神经等级模型(NRM)表现出显著的成功,特别是在经过培训的语言模型方面。但是,深神经模型因其易受对抗性实例的影响而臭名昭著。由于我们日益依赖神经信息检索模型,反向攻击可能成为一种新型的网络垃圾技术。因此,必须研究潜在的对抗性攻击,在部署之前先确定NRM的脆弱性。在本文中,我们提出了针对NRM的“WSRA”替代性攻击(WSRA)任务,目的是通过在文本中添加对抗性干扰来在排名中促进一个目标文件。我们侧重于基于决定的黑箱攻击设置,攻击者无法使用模型参数和梯度,但只能通过查询目标模型获得部分检索清单的等级位置。这种攻击环境在现实世界搜索引擎中是现实的。我们提出了一个新的“Pseudo Delitive-broup-Adversari等级攻击方法(PRADADADA)”,目的是通过Pseubdo Refect(PRF)来学习一个基于防御性文件模型的替代模型模型模型,以便生成基于现有攻击性数据库的升级搜索和测试性战略,从而显示测试性战略。