The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
翻译:大语言模型(LLM)的快速进步得益于对混合真实与合成数据的数据集日益增长的依赖。尽管合成数据提供了可扩展性和成本效益,但它常常引入系统性的分布差异,特别是由于数据生成机制(如top-p采样、温度缩放和有限采样)的截断效应,导致长尾知识代表性不足。这些差异对表征和评估混合真实-合成数据集的效用构成了根本性挑战。在本文中,我们识别出一种由两个断点表征的三阶段缩放行为,这些断点反映了模型在学习头部和尾部知识过程中行为的转变。我们进一步推导出一个专为真实与合成数据混合设计的LLM泛化界,揭示了控制其泛化性能的几个关键因素。基于我们的理论发现,我们提出了一种有效且高效的数据估值方法,可扩展到大规模数据集。在四个任务(包括图像分类、情感分类、指令遵循和复杂推理)上的全面实验表明,我们的方法在数据估值方面超越了最先进的基线,且计算成本显著降低。