Organizations increasingly adopt Retrieval-Augmented Generation (RAG) to enhance Large Language Models with enterprise-specific knowledge. However, current data quality (DQ) frameworks have been primarily developed for static datasets, and only inadequately address the dynamic, multi-stage nature of RAG systems. This study aims to develop DQ dimensions for this new type of AI-based systems. We conduct 16 semi-structured interviews with practitioners of leading IT service companies. Through a qualitative content analysis, we inductively derive 15 distinct DQ dimensions across the four processing stages of RAG systems: data extraction, data transformation, prompt & search, and generation. Our findings reveal that (1) new dimensions have to be added to traditional DQ frameworks to also cover RAG contexts; (2) these new dimensions are concentrated in early RAG steps, suggesting the need for front-loaded quality management strategies, and (3) DQ issues transform and propagate through the RAG pipeline, necessitating a dynamic, step-aware approach to quality management.
翻译:组织日益采用检索增强生成(RAG)技术,将企业特定知识融入大型语言模型。然而,当前的数据质量(DQ)框架主要针对静态数据集开发,未能充分应对RAG系统动态、多阶段的特性。本研究旨在为这类新型基于人工智能的系统构建数据质量维度体系。我们对领先IT服务公司的实践者进行了16次半结构化访谈。通过定性内容分析,我们归纳出涵盖RAG系统四个处理阶段(数据提取、数据转换、提示与检索、生成)的15个独立数据质量维度。研究发现表明:(1)需要在传统DQ框架中新增维度以覆盖RAG场景;(2)新增维度集中出现在RAG流程前期,这提示需要采用前置式质量管理策略;(3)数据质量问题会在RAG管道中转化与传播,因此需要建立动态的、具备阶段感知能力的质量管理方法。