Text-to-SQL, the task of translating natural language questions into SQL queries, has significantly advanced with the introduction of Large Language Models (LLMs), broadening database accessibility for a wide range of users. Despite substantial progress in generating valid SQL, current LLMs still struggle with complex queries. To address this limitation, test-time strategies such as Best-of-N (BoN) and Majority Voting (Maj) are often employed, based on the assumption that LLMs can produce correct answers after multiple attempts. However, these methods rely on surface-level heuristics, selecting the syntactically correct query through execution-based BoN (ex-BoN) or the most frequently generated one through Majority Voting. Recently, Outcome Reward Models (ORMs), which assign utility scores to generated outputs based on semantic correctness, have emerged as a promising reinforcement learning approach for improving model alignment. We argue that ORMs could serve as an effective new test-time heuristic, although their application in this context remains largely underexplored. In this work, we propose a unified framework for training ORMs tailored to the Text-to-SQL task and assess their effectiveness as a test-time heuristic within the BoN strategy. We benchmark ORMs against ex-BoN and Maj across the BIRD and Spider datasets, fine-tuning diverse open-source LLMs from the Qwen2, Granite3, and Llama3 families. Results show that ORMs outperform ex-BoN and Maj, achieving execution accuracy gains of +4.33% (BIRD) and +2.10% (Spider) over ex-BoN, and +2.91% (BIRD) and +0.93% (Spider) over Maj. We further demonstrate that finetuning models already aligned with SQL generation, such as OmniSQL, yields superior ORM performance. Additionally, we observe that ORMs achieve competitive results on simple queries and benefit more from an increased number of candidates compared to ex-BoN and Maj.
翻译:文本到SQL,即将自然语言问题翻译为SQL查询的任务,随着大型语言模型(LLM)的引入已取得显著进展,极大地拓宽了各类用户对数据库的访问能力。尽管在生成有效SQL方面取得了实质性进步,但当前的LLM在处理复杂查询时仍面临挑战。为应对这一局限,测试时策略如最佳N选一(BoN)和多数投票(Maj)常被采用,其基于LLM经过多次尝试后能产生正确答案的假设。然而,这些方法依赖于表层启发式规则,通过基于执行的最佳N选一(ex-BoN)选择语法正确的查询,或通过多数投票选择最频繁生成的查询。近年来,结果奖励模型(ORM)作为一种基于语义正确性为生成输出分配效用分数的强化学习方法,已成为提升模型对齐性的有前景的途径。我们认为ORM可作为一种有效的新型测试时启发式方法,尽管其在此背景下的应用仍很大程度上未被充分探索。在本工作中,我们提出了一个针对文本到SQL任务定制训练ORM的统一框架,并评估其作为测试时启发式方法在BoN策略中的有效性。我们在BIRD和Spider数据集上将ORM与ex-BoN和Maj进行基准测试,并对来自Qwen2、Granite3和Llama3系列的不同开源LLM进行微调。结果显示,ORM优于ex-BoN和Maj,相较于ex-BoN在执行准确率上分别提升了+4.33%(BIRD)和+2.10%(Spider),相较于Maj分别提升了+2.91%(BIRD)和+0.93%(Spider)。我们进一步证明,对已与SQL生成对齐的模型(如OmniSQL)进行微调,可获得更优的ORM性能。此外,我们观察到ORM在简单查询上取得了有竞争力的结果,并且相较于ex-BoN和Maj,能从候选查询数量的增加中获得更大收益。