Self-training (ST) has prospered again in language understanding by augmenting the fine-tuning of pre-trained language models when labeled data is insufficient. However, it remains challenging to incorporate ST into attribute-controllable language generation. Augmented by only self-generated pseudo text, generation models over-emphasize exploitation of the previously learned space, suffering from a constrained generalization boundary. We revisit ST and propose a novel method, DuNST to alleviate this problem. DuNST jointly models text generation and classification with a shared Variational AutoEncoder and corrupts the generated pseudo text by two kinds of flexible noise to disturb the space. In this way, our model could construct and utilize both pseudo text from given labels and pseudo labels from available unlabeled text, which are gradually refined during the ST process. We theoretically demonstrate that DuNST can be regarded as enhancing exploration towards the potential real text space, providing a guarantee of improved performance. Experiments on three controllable generation tasks show that DuNST could significantly boost control accuracy while maintaining comparable generation fluency and diversity against several strong baselines.
翻译:在标签数据不足的情况下,自我培训在语言理解方面再次发扬光大,加强了对预先培训的语言模型的微调;然而,在标签数据不足的情况下,将ST纳入可属性控制的语言生成中仍具有挑战性;仅靠自产伪文本,生成模型过分强调利用先前学到的空间,受限制的泛化边界的影响;我们重新审视ST并提出了一个新颖的方法,即DuNST来缓解这一问题。DuNST以共同的变式自动编码联合模型文本生成和分类,将生成的伪文本通过两种灵活的噪音来破坏空间,从而腐蚀生成的伪文本。通过这种方式,我们的模型可以建构和利用从特定标签中产生的伪文本,以及从现有的未加标签文本中的伪标签,后者在ST过程中逐步完善。我们理论上证明,DuNST可以被视为加强探索潜在的真实文本空间,提供改进性能的保证。关于三种可控生成任务的实验表明DinST可以大大提高控制精度,同时保持相似的一代的流度和多样性,以若干强的基线为基础。