Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
翻译:抽象压缩利用较小规模的语言模型对查询相关上下文进行压缩,以降低检索增强生成(RAG)的计算成本。然而,检索到的文档常包含与回答查询无关的信息,或因事实性错误内容而产生误导,尽管其相关性评分较高。这种行为表明,抽象压缩器更可能遗漏对正确答案至关重要的关键信息,尤其在长上下文中出现注意力分散时。为解决此问题,我们对检索文档进行更细粒度的分类,并提出抗噪声的抽象压缩方法(ACoRN),该方法引入两个新颖的训练步骤。首先,我们在训练数据集上采用离线数据增强技术,以提升压缩器对两种不同类型检索噪声的鲁棒性。其次,由于基于语言模型的压缩器无法充分利用多个检索文档的信息且存在位置偏差,我们通过微调使其生成围绕直接支持正确答案的关键信息为中心的摘要。实验表明,采用ACoRN作为压缩器训练的T5-large模型,在保持答案字符串(可作为直接证据)的同时,提升了EM和F1分数。ACoRN在包含大量降低准确性的文档的数据集上表现优异,使其在实际应用场景中具有重要价值。