检索增强生成中的数据质量挑战 (Data Quality Challenges in Retrieval-Augmented Generation)

from arxiv, Preprint version. Accepted for presentation at the International Conference on Information Systems (ICIS 2025). Please cite the published version when available

Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.

翻译：组织日益采用检索增强生成（RAG）技术，将企业特定知识融入大型语言模型。然而，当前的数据质量（DQ）框架主要针对静态数据集开发，未能充分应对RAG系统动态、多阶段的特性。本研究旨在为这类新型基于人工智能的系统构建数据质量维度体系。我们对领先IT服务公司的实践者进行了16次半结构化访谈。通过定性内容分析，我们归纳出涵盖RAG系统四个处理阶段（数据提取、数据转换、提示与检索、生成）的15个独立数据质量维度。研究发现表明：（1）需要在传统DQ框架中新增维度以覆盖RAG场景；（2）新增维度集中出现在RAG流程前期，这提示需要采用前置式质量管理策略；（3）数据质量问题会在RAG管道中转化与传播，因此需要建立动态的、具备阶段感知能力的质量管理方法。