Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality challenges when human resources departments perform decentralized manual entry across multiple languages. We present an end-to-end pipeline combining automated data cleaning with LLM-driven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multi-stage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% schema compliance, and 90.7\% semantic accuracy on 2,847 production queries. The system reduces query turnaround time from 2.3 days to under 5 seconds while maintaining 99.2% uptime, with GPT-4o achieving 46% lower latency and 68% cost reduction versus GPT-3.5. This modular architecture provides a reproducible framework for AI-native enterprise data governance, demonstrating real-world viability at enterprise scale with 4.3/5.0 user satisfaction.
翻译:企业ERP系统在管理数十万员工记录时,当人力资源部门以分散方式跨多种语言进行人工录入时,面临关键的数据质量挑战。我们提出了一种端到端流程,将自动化数据清洗与基于大语言模型的SQL查询生成相结合,并在一个管理24万员工记录的生产系统中部署运行超过六个月。该系统以两个集成阶段运行:一个多阶段清洗流程,在从Microsoft SQL Server到PostgreSQL的定期同步过程中执行翻译归一化、拼写校正和实体去重;以及一个由GPT-4o驱动的检索增强生成框架,可将土耳其语、俄语和英语的自然语言问题转换为经过验证的SQL查询。该查询引擎采用LangChain编排、FAISS向量相似性搜索以及基于500多个已验证示例的小样本学习。我们的评估显示,在2847个生产查询中,系统实现了92.5%的查询有效性、95.1%的模式合规性和90.7%的语义准确性。该系统将查询周转时间从2.3天缩短至5秒以内,同时保持99.2%的运行时间,与GPT-3.5相比,GPT-4o实现了46%的延迟降低和68%的成本节约。这种模块化架构为AI原生企业数据治理提供了一个可复现的框架,在企业规模下展示了实际可行性,用户满意度达4.3/5.0。