Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks. In this paper, we study different types of transformer based pre-trained models such as auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation. We show that prepending the class labels to text sequences provides a simple yet effective way to condition the pre-trained models for data augmentation. Additionally, on three classification benchmarks, pre-trained Seq2Seq model outperforms other data augmentation methods in a low-resource setting. Further, we explore how different pre-trained model based data augmentation differs in-terms of data diversity, and how well such methods preserve the class-label information.
翻译:基于语言的预培训模型,如BERT,在不同的国家劳工计划任务中取得了显著成果。在本文中,我们研究了基于不同类型的基于变压器的预培训模型,如自动递减模型(GPT-2)、自动编码模型(BERT)和有条件数据扩增的后继2seq模型(BART)等。我们表明,在文本序列中预先打上类标签为确定经过预先培训的数据扩增模型提供了简单而有效的条件。此外,在三个分类基准方面,预先培训的Seq2Seqe模型在低资源环境下优于其他数据扩增方法。此外,我们探索了不同经过培训的基于数据扩增模型在数据多样性的术语上如何不同,以及这类方法如何保护分类标签信息。