In Text-to-SQL generation, large language models (LLMs) have shown strong generalization and adaptability. However, LLMs sometimes generate hallucinations, i.e.,unrealistic or illogical content, which leads to incorrect SQL queries and negatively impacts downstream applications. Detecting these hallucinations is particularly challenging. Existing Text-to-SQL error detection methods, which are tailored for traditional deep learning models, face significant limitations when applied to LLMs. This is primarily due to the scarcity of ground-truth data. To address this challenge, we propose SQLHD, a novel hallucination detection method based on metamorphic testing (MT) that does not require standard answers. SQLHD splits the detection task into two sequentiial stages: schema-linking hallucination detection via eight structure-aware Metamorphic Relations (MRs) that perturb comparative words, entities, sentence structure or database schema, and logical-synthesis hallucination detection via nine logic-aware MRs that mutate prefix words, extremum expressions, comparison ranges or the entire database. In each stage the LLM is invoked separately to generate schema mappings or SQL artefacts; the follow-up outputs are cross-checked against their source counterparts through the corresponding MRs, and any violation is flagged as a hallucination without requiring ground-truth SQL. The experimental results demonstrate our method's superior performance in terms of the F1-score, which ranges from 69.36\% to 82.76\%. Additionally, SQLHD demonstrates superior performance over LLM Self-Evaluation methods, effectively identifying hallucinations in Text-to-SQL tasks.
翻译:在文本到SQL生成任务中,大语言模型展现出强大的泛化能力与适应性。然而,大语言模型有时会产生幻觉,即生成不真实或不合逻辑的内容,导致生成错误的SQL查询,并对下游应用产生负面影响。检测此类幻觉具有特殊挑战性。现有面向传统深度学习模型的文本到SQL错误检测方法在应用于大语言模型时存在显著局限,这主要源于标准答案数据的稀缺性。为应对这一挑战,我们提出SQLHD——一种无需标准答案、基于蜕变测试的新型幻觉检测方法。SQLHD将检测任务拆分为两个连续阶段:通过八种结构感知的蜕变关系(通过扰动比较词、实体、句子结构或数据库模式实现)进行模式链接幻觉检测,以及通过九种逻辑感知的蜕变关系(通过变换前缀词、极值表达式、比较范围或整个数据库实现)进行逻辑合成幻觉检测。每个阶段分别调用大语言模型生成模式映射或SQL中间产物;后续输出通过对应蜕变关系与原始输出进行交叉验证,任何违反关系的情况均被标记为幻觉,且无需标准SQL答案。实验结果表明,本方法在F1分数(69.36%至82.76%范围内)上具有优越性能。此外,SQLHD相较于大语言模型自评估方法展现出更优的检测效果,能有效识别文本到SQL任务中的幻觉现象。