There is a rising interest in further exploring the zero-shot learning potential of large pre-trained language models (PLMs). A new paradigm called data-generation-based zero-shot learning has achieved impressive success. In this paradigm, the synthesized data from the PLM acts as the carrier of knowledge, which is used to train a task-specific model with orders of magnitude fewer parameters than the PLM, achieving both higher performance and efficiency than prompt-based zero-shot learning methods on PLMs. The main hurdle of this approach is that the synthesized data from PLM usually contains a significant portion of low-quality samples. Fitting on such data will greatly hamper the performance of the task-specific model, making it unreliable for deployment. Previous methods remedy this issue mainly by filtering synthetic data using heuristic metrics(e.g., output confidence), or refining the data with the help of a human expert, which comes with excessive manual tuning or expensive costs. In this paper, we propose a novel noise-robust re-weighting framework SunGen to automatically construct high-quality data for zero-shot classification problems. Our framework features the ability to learn the sample weights indicating data quality without requiring any human annotation. We theoretically and empirically verify the ability of our method to help construct good-quality synthetic datasets. Notably, SunGen-LSTM yields a 9.8% relative improvement than the baseline on average accuracy across eight different established text classification tasks.
翻译:人们越来越有兴趣进一步探讨大型预先培训语言模型(PLM)的零点学习潜力。一种称为基于数据生成的零点学习的新模式取得了令人印象深刻的成功。在这个模式中,来自PLM的合成数据作为知识的载体,主要通过使用超光度测量(例如,产出信心)来筛选合成数据,或利用人类专家的帮助来改进数据,而人类专家的手动调整过多或费用昂贵。在本文中,我们建议一个新的噪音-紫色重新加权框架 SunGen 来自动建立高质量数据,解决零点分类问题。适应这些数据将大大妨碍具体任务模型的性能,使其不适于部署。以前的方法主要通过使用超光彩度测量(例如,产出信心)来筛选合成数据,或利用人类专家的帮助来改进数据。我们提出了一个新的噪音-紫色重新加权框架,以自动构建高品质数据,解决零点分类问题。我们的框架规定,将具备一种能力,在不要求我们平均质量方面进行测试的8项质量数据,而无需进行我们平均质量的模型质量分析。</s>