This paper explores zero-label learning in Natural Language Processing (NLP), whereby no human-annotated data is used anywhere during training and models are trained purely on synthetic data. At the core of our framework is a novel approach for better leveraging the powerful pretrained language models. Specifically, inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation (UDG), which leverages few-shot prompts to synthesize high-quality training data without real human annotations. Our method enables zero-label learning as we train task-specific models solely on the synthetic data, yet we achieve better or comparable results from strong baseline models trained on human-labeled data. Furthermore, when mixed with labeled data, our approach serves as a highly effective data augmentation procedure, achieving new state-of-the-art results on the SuperGLUE benchmark.
翻译:本文探讨了自然语言处理中的零标签学习,在培训过程中任何地方都不使用任何附加说明的数据,模型中纯粹是用合成数据进行培训。我们框架的核心是更好地利用强大的预先培训的语言模型的新颖方法。具体地说,在近期几发关于GPT-3的推论成功启发下,我们提出了一个名为UDG的培训数据生成程序,该程序利用几发提示来综合高质量的培训数据,而没有真正的人类说明。我们的方法使我们在只用合成数据来培训具体任务模型时,能够进行零标签学习,但我们从在人类标签数据上培训的强力基线模型中获得更好或可比的成果。 此外,如果与标签数据相结合,我们的方法可以成为一个高效的数据增强程序,在超级GLUE基准上实现新的最新结果。