GradeSQL：基于结果奖励模型的大型语言模型文本到SQL生成测试时推理 (GradeSQL: Test-Time Inference with Outcome Reward Models for Text-to-SQL Generation from Large Language Models)

Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.

翻译：文本到SQL，即将自然语言问题翻译为SQL查询的任务，随着大型语言模型（LLM）的引入已取得显著进展，极大地拓宽了各类用户对数据库的访问能力。尽管在生成有效SQL方面取得了实质性进步，但当前的LLM在处理复杂查询时仍面临挑战。为应对这一局限，测试时策略如最佳N选一（BoN）和多数投票（Maj）常被采用，其基于LLM经过多次尝试后能产生正确答案的假设。然而，这些方法依赖于表层启发式规则，通过基于执行的最佳N选一（ex-BoN）选择语法正确的查询，或通过多数投票选择最频繁生成的查询。近年来，结果奖励模型（ORM）作为一种基于语义正确性为生成输出分配效用分数的强化学习方法，已成为提升模型对齐性的有前景的途径。我们认为ORM可作为一种有效的新型测试时启发式方法，尽管其在此背景下的应用仍很大程度上未被充分探索。在本工作中，我们提出了一个针对文本到SQL任务定制训练ORM的统一框架，并评估其作为测试时启发式方法在BoN策略中的有效性。我们在BIRD和Spider数据集上将ORM与ex-BoN和Maj进行基准测试，并对来自Qwen2、Granite3和Llama3系列的不同开源LLM进行微调。结果显示，ORM优于ex-BoN和Maj，相较于ex-BoN在执行准确率上分别提升了+4.33%（BIRD）和+2.10%（Spider），相较于Maj分别提升了+2.91%（BIRD）和+0.93%（Spider）。我们进一步证明，对已与SQL生成对齐的模型（如OmniSQL）进行微调，可获得更优的ORM性能。此外，我们观察到ORM在简单查询上取得了有竞争力的结果，并且相较于ex-BoN和Maj，能从候选查询数量的增加中获得更大收益。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日