We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose LENS, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with LENS consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.
翻译:本文提出并形式化了合成数据集质量评估(SynQuE)问题:仅利用有限的未标注真实数据,对合成数据集按其预期真实世界任务性能进行排序。这一方法解决了因数据收集成本或隐私约束导致数据稀缺的关键开放挑战。我们通过引入并评估代理指标,建立了该问题的首个综合性基准,这些指标通过选择用于训练的合成数据来最大化真实数据上的任务性能。我们通过嵌入模型将基于分布和多样性的距离度量适配到当前情境,提出了SynQuE的首批代理指标。针对这些指标在复杂规划任务上的不足,我们提出了LENS——一种利用大语言模型推理的新型代理指标。实验结果表明,SynQuE代理指标在情感分析、Text2SQL、网页导航和图像分类等多种任务中与真实任务性能具有相关性,其中LENS通过捕捉细微特征,在复杂任务上持续优于其他方法。例如,在文本到SQL解析任务中,通过SynQuE代理指标选取的前3个合成数据集进行训练,相比无差别选择数据,平均准确率可从30.4%提升至38.4%(+8.1%)。本工作确立了SynQuE作为真实数据稀缺条件下合成数据选择的实用框架,并为基于基础模型的数据表征和细粒度数据选择的未来研究提供了方向。