Manually annotating datasets requires domain experts to read through many documents and carefully label them, which is often expensive. Recently, pre-trained generative language models (GLMs) have demonstrated exceptional abilities in generating text which motivates to leverage them for generative data augmentation. We improve generative data augmentation by formulating the data generation as context generation task and use question answering (QA) datasets for intermediate training. Specifically, we view QA to be more as a format than of a task and train GLMs as context generators for a given question and its respective answer. Then, we cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which is further used as synthetic training data for their corresponding tasks. We perform extensive experiments, case studies, and ablation studies on multiple sentiment and topic classification datasets and demonstrate substantial improvements in performance in few-shot, zero-shot settings. Remarkably, on the SST-2 dataset, intermediate training on SocialIQA dataset achieves an improvement of 40% on Macro-F1 score. Through thorough analyses, we observe that QA datasets that requires high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.
翻译:手工说明数据集,要求域专家通过许多文件阅读,并仔细标注,而这往往费用很高。最近,经过预先训练的基因化语言模型(GLMs)在生成文本方面表现出非凡的能力,这些文本能够激励它们发挥增强基因化数据的作用。我们通过将数据生成作为背景生成任务来改进基因化数据的增强,并在中期培训中使用问答(QA)数据集。具体地说,我们认为质量A是一种比任务格式更多的格式,将GLMs培训为特定问题及其相应答案的上下文生成器。然后,我们把下游任务放到回答格式中,并将精细调整的上下文生成器调整到目标任务域。最后,我们利用经过微调的GLMM来生成相关背景,进一步用作相应任务的合成培训数据。我们进行了广泛的实验、案例研究和关于多种情绪和主题分类数据集的折叠加研究,并展示了微小、零光谱环境中的性能。 值得注意的是,SST-2数据集的中级培训,SIQA数据集实现了40 %的升级能力,我们通过对GMA1的深度数据进行共同的推算。