Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.
翻译:文本到SQL(Text-to-SQL)作为自然语言处理(NLP)领域的一项关键任务,旨在将文本查询转换为可执行的SQL语句,近年来已取得显著进展。然而,用于训练和评估文本到SQL模型的现有评价与奖励机制仍是一个关键瓶颈。当前方法严重依赖人工标注的标准SQL查询,这类标注成本高昂且难以适用于大规模评估。更重要的是,文本到SQL中的大多数强化学习(RL)方法仅利用最终二元执行结果作为奖励信号,这种粗粒度的监督机制忽略了从评分准则视角出发的详细结构性和语义错误。为应对这些挑战,我们提出了RuCo-C,一种新颖的生成式评判模型,用于实现细粒度、查询特定的自动化评估,并生成无需人工干预的可解释性评判。我们的框架首先自动生成查询特定的评估准则,以替代人工标注,并将其与可解释性评判相关联。随后,在强化学习训练过程中,通过“渐进式探索”策略整合密集化的奖励反馈,动态调整奖励以提升模型性能。综合实验表明,RuCo-C在文本到SQL评估中优于现有方法,带来了显著的性能提升。