The ability of generative language models (GLMs) to generate text has improved considerably in the last few years, enabling their use for generative data augmentation. In this work, we propose CONDA, an approach to further improve GLMs' ability to generate synthetic data by reformulating data generation as context generation for a given question-answer (QA) pair and leveraging QA datasets for training context generators. Then, we cast downstream tasks into the same question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which are in turn used as synthetic training data for their corresponding tasks. We perform extensive experiments on multiple classification datasets and demonstrate substantial improvements in performance for both few- and zero-shot settings. Our analysis reveals that QA datasets that require high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.
翻译:在这项工作中,我们建议CONDA, 这是一种进一步提高GLM生成合成数据的能力的方法,办法是将数据生成作为一对特定问答(QA)的上下文生成,并利用QA数据集来培训背景生成器。然后,我们将下游任务放到一个回答问题的格式中,并将微调背景生成器调整到目标任务领域。最后,我们使用微调的GLM来生成相关环境,这些环境又反过来用作相应任务的合成培训数据。我们对多个分类数据集进行广泛的实验,并展示少数和零弹环境的性能显著改善。我们的分析表明,需要高水平推理能力的QA数据集(例如,抽象和常见的QA数据集)往往在几发和零弹环境中都能够最好地提高性能。