Large language models (LLMs) have transformed natural language processing (NLP), enabling applications from content generation to decision support. Retrieval-Augmented Generation (RAG) improves LLMs by incorporating external knowledge but also introduces security risks, particularly from data poisoning, where the attacker injects poisoned texts into the knowledge database to manipulate system outputs. While various defenses have been proposed, they often struggle against advanced attacks. To address this, we introduce RAGuard, a detection framework designed to identify poisoned texts. RAGuard first expands the retrieval scope to increase the proportion of clean texts, reducing the likelihood of retrieving poisoned content. It then applies chunk-wise perplexity filtering to detect abnormal variations and text similarity filtering to flag highly similar texts. This non-parametric approach enhances RAG security, and experiments on large-scale datasets demonstrate its effectiveness in detecting and mitigating poisoning attacks, including strong adaptive attacks.
翻译:大语言模型(LLMs)已经变革了自然语言处理(NLP),实现了从内容生成到决策支持的各种应用。检索增强生成(RAG)通过整合外部知识改进了LLMs,但也引入了安全风险,特别是来自数据投毒的攻击,即攻击者将投毒文本注入知识数据库以操纵系统输出。尽管已有多种防御方案被提出,但它们往往难以应对高级攻击。为解决这一问题,我们提出了RAGuard,一个旨在识别投毒文本的检测框架。RAGuard首先扩展检索范围以增加干净文本的比例,降低检索到投毒内容的可能性。随后,它应用分块困惑度过滤来检测异常变化,并通过文本相似性过滤标记高度相似的文本。这种非参数方法增强了RAG的安全性,在大规模数据集上的实验证明了其在检测和缓解投毒攻击(包括强自适应攻击)方面的有效性。