Despite the success of text-to-text pre-trained models in various natural language generation (NLG) tasks, the generation performance is largely restricted by the number of labeled data in downstream tasks, particularly in data-to-text generation tasks. Existing works mostly utilize abundant unlabeled structured data to conduct unsupervised pre-training for task adaption, which fail to model the complex relationship between source structured data and target texts. Thus, we introduce self-training as a better few-shot learner than task-adaptive pre-training, which explicitly captures this relationship via pseudo-labeled data generated by the pre-trained model. To alleviate the side-effect of low-quality pseudo-labeled data during self-training, we propose a novel method called Curriculum-Based Self-Training (CBST) to effectively leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Experimental results show that our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
翻译:尽管在各种自然语言生成(NLG)任务中,经过文字到文字的预培训模型取得了成功,但生成绩效在很大程度上受到下游任务,特别是数据到文字生成任务的标签数据数量的限制。现有作品大多使用大量未经标记的结构化数据进行任务调整前的未经监督的培训前培训,而这种培训未能模拟源结构化数据和目标文本之间的复杂关系。因此,我们引入自我培训,将其作为比任务适应前培训更好的少见学习者,这种培训通过预先训练模型产生的假标签数据明确捕捉这种关系。为了在自我培训期间减轻低质量的伪标签数据的副作用,我们提议了一种称作基于课程的自我培训(CBST)的新方法,以便在由文字生成困难决定的重新组合的顺序中有效地利用未经标记的数据。实验结果表明,我们的方法可以超越完善的微调和任务适应性培训前方法,并在数据生成的少数镜头中取得最先进的业绩。