Retrieval-Augmented Generation (RAG) enriches LLMs by dynamically retrieving external knowledge, reducing hallucinations and satisfying real-time information needs. While existing research mainly targets RAG's performance and efficiency, emerging studies highlight critical security concerns. Yet, current adversarial approaches remain limited, mostly addressing white-box scenarios or heuristic black-box attacks without fully investigating vulnerabilities in the retrieval phase. Additionally, prior works mainly focus on factoid Q&A tasks, their attacks lack complexity and can be easily corrected by advanced LLMs. In this paper, we investigate a more realistic and critical threat scenario: adversarial attacks intended for opinion manipulation against black-box RAG models, particularly on controversial topics. Specifically, we propose FlippedRAG, a transfer-based adversarial attack against black-box RAG systems. We first demonstrate that the underlying retriever of a black-box RAG system can be reverse-engineered, enabling us to train a surrogate retriever. Leveraging the surrogate retriever, we further craft target poisoning triggers, altering vary few documents to effectively manipulate both retrieval and subsequent generation. Extensive empirical results show that FlippedRAG substantially outperforms baseline methods, improving the average attack success rate by 16.7%. FlippedRAG achieves on average a 50% directional shift in the opinion polarity of RAG-generated responses, ultimately causing a notable 20% shift in user cognition. Furthermore, we evaluate the performance of several potential defensive measures, concluding that existing mitigation strategies remain insufficient against such sophisticated manipulation attacks. These results highlight an urgent need for developing innovative defensive solutions to ensure the security and trustworthiness of RAG systems.
翻译:检索增强生成(RAG)通过动态检索外部知识来增强大语言模型,减少幻觉并满足实时信息需求。现有研究主要关注RAG的性能与效率,而新兴研究则凸显了其关键的安全隐患。然而,当前的对抗攻击方法仍存在局限,大多针对白盒场景或启发式的黑盒攻击,未能充分探究检索阶段的脆弱性。此外,先前工作主要集中于事实性问答任务,其攻击缺乏复杂性,且易被先进的大语言模型纠正。本文研究了一种更现实且关键的威胁场景:针对黑盒RAG模型、意图操纵观点的对抗攻击,尤其是在争议性话题上。具体而言,我们提出了FlippedRAG,一种针对黑盒RAG系统的基于迁移的对抗攻击方法。我们首先证明黑盒RAG系统的底层检索器可以被逆向工程,从而训练出一个替代检索器。利用该替代检索器,我们进一步构造目标污染触发器,仅需修改极少量文档即可有效操纵检索过程及后续的生成结果。大量实验结果表明,FlippedRAG显著优于基线方法,将平均攻击成功率提升了16.7%。FlippedRAG平均能使RAG生成响应的观点极性发生50%的方向性偏移,最终导致用户认知产生20%的显著改变。此外,我们评估了多种潜在防御措施的性能,结论是现有的缓解策略仍不足以应对此类复杂的操纵攻击。这些结果凸显了开发创新性防御解决方案以确保RAG系统安全性与可信度的迫切需求。