This paper focuses on the data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA) which is an Seq2Seq language model pre-trained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one) such that KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format, and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL'03 and WikiAnn; ii) KnowDA successfully transfers the task knowledge to NLP tasks whose types are seen and unseen in KoMT.
翻译:本文侧重于低资源 NLP 任务中培训范围有限的低资源 NLP 任务的数据增强。 现有的解决方案要么利用任务独立的 NLP 任务增强数据, 要么利用任务独立的 NLP 任务增强数据( 如 Synonom 替换), 要么利用微调的通用预培训语言模型( 如 GPT2 ), 使用有限的培训实例来生成新的合成数据。 因此, 他们拥有微不足道的任务特有知识, 仅限于生成低质量合成数据。 为了解决这个问题, 我们提议了知识混合的 Nixture DA 数据增强模型( KindDA ), 这是一种Seq2Seq 语言模型, 在新颖的知识混合的 NLP 任务组合下, 在新颖的 Mixture 培训框架( KOMT ) 下, 将不同的 NLPP 任务具体任务整合到 IMDA, 将我们最高级的ODA 数据库, 将我们最高级的ODA 和最高级的OLMT 任务应用到 数据库。