Large language models (LLMs) show great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by the LLM: we show that, for problems with structured outputs, it is possible to prompt an LLM to perform the task in the opposite direction, to generate plausible text for the target structure. Leveraging the asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, demonstrate its superior quality compared to existing datasets in a human evaluation and use it to finetune small models (220M and 770M parameters). The models we introduce, SynthIE, outperform existing baselines of comparable size with a substantial gap of 57 and 79 absolute points in micro and macro F1, respectively. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.
翻译:大型语言模型(LLMs)显示了合成数据生成的巨大潜力。 这项工作表明,即使对于无法直接由LLM解决的任务,也可以合成生成有用的数据:我们表明,对于结构化产出的问题,有可能促使LLM执行相反的任务,为目标结构产生可信的文本。利用任务难度的不对称性,有可能为复杂任务产生大规模、高质量的数据。我们展示了这种封闭式信息提取方法的有效性,即收集地面真相数据具有挑战性,而迄今为止还不存在令人满意的数据集。我们合成生成了一个1.8M数据集,显示其质量优于人类评估中的现有数据集,并利用它微小模型(220M和770M参数)来微小模型。我们引入的模型SynthIE,超越了类似规模的现有基线,在微型和宏观F1中分别存在57和79个绝对点的巨大差距。代码、数据和模型见https://github.com/epfl-dlab/Synthie。</s>