生成、注解和学习:带有合成文字的NLP (Generate, Annotate, and Learn: NLP with Synthetic Text)

Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) has enabled effective optimization of compact neural nets, achieving the best results when the knowledge of an expensive network is distilled via fresh task-specific unlabeled data. However, task-specific unlabeled data can be challenging to find, especially for NLP. We investigate the use of generative models in synthesizing unlabeled data and present a simple and general framework called "generate, annotate, and learn (GAL)". A language model (LM) is used to synthesize in-domain unlabeled data. Then, a classifier is used to annotate such data. Finally, synthetically generated and annotated data is used to advance SSL, KD, and few-shot learning on NLP and tabular tasks. To obtain a strong task-specific LM, we either fine-tune a large LM on inputs from a specific task, or prompt a large LM with a few input examples and conditionally generate more unlabeled examples. It also yields a new state-of-the-art for 6-layer transformers on the GLUE leaderboard. Finally, self-training with GAL offers large gains on four tabular tasks from the UCI repository.

翻译：半半超学习(SSL)在许多应用领域都取得了成功,但这一成功往往取决于特定任务无标签数据的可用性。知识蒸馏(KD)使精密神经网得以有效优化,当一个昂贵网络的知识通过新的特定任务无标签数据蒸馏时,就能取得最佳结果。然而,具体任务无标签数据可能难以找到,特别是对NLP而言。我们调查了在合成无标签数据时使用归正模型的情况,并提出了一个简单和一般的框架,称为“基因、注解和学习(GAL) ” 。一种语言模型(LM)被用于合成内部无标签数据。然后,一个分类器被用于对这些数据进行批注。最后,一个合成生成的和附加说明的数据被用于推进SSL、KD和在 NLP 和表格任务上几分解的学习。为了获得一个强有力的具体任务LM, 我们要么对一项特定任务的投入进行精细化的LM, 或促使一个大型的LM(L)模型用来合成内部数据。最后, 也用一个大型的GLA级的硬质的模型来生成一个大的GL 。