This paper focuses on text data augmentation for few-shot NLP tasks. The existing data augmentation algorithms either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using a small training set to produce new synthetic data. Consequently, these methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks. To combat this issue, we propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks using Knowledge Mixture Training (KoMT). KoMT is a training procedure that reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format and employs denoising objectives in different granularity to learn to generate partial or complete samples. With the aid of KoMT, KnowDA could combine required task-specific knowledge implicitly from the learned mixture of tasks and quickly grasp the inherent synthesis law of the target task through a few given instances. To the best of our knowledge, we are the first attempt to scale the number of tasks to 100+ in multi-task co-training for data augmentation. Extensive experiments show that i) KnowDA successfully improves the performance of Albert and Deberta by a large margin on the FewGLUE benchmark, outperforming previous state-of-the-art data augmentation baselines; ii) KnowDA could also improve the model performance on the few-shot NER tasks, a held-out task type not included in KoMT.
翻译:本文侧重于用于微小的 NLP 任务的文本数据增强。 现有的数据增强算法要么利用任务独立的超常规则( 例如, Synonom 替换), 要么借助于任务独立的超常规则( 例如, Synonom 替换), 或者微调通用的预培训语言模型( 例如, GPT2), 使用一个小型培训组来生成新的合成数据。 因此, 这些方法具有微不足道的任务特有知识, 仅限于为简单任务的薄弱基线生成低质量的合成数据 。 为了解决这一问题, 我们提议了知识混合数据增强模型( Kindada ) : 在使用知识混合培训( KoMT) 的不同 NLP 任务组合上, 预先训练了一个编码- decodecoder LM 。 KoMTA 是一个培训程序, 将各种变异性 NLPP任务的实例重新纳入统一的文本到文本格式格式, 在不同颗粒化目标中, 以学习部分或全部样本。 在所了解的任务组合中, KnendDA 可以将所需的具体任务改进任务类型, 通过少数实例, 尝试 IMDA IMA 上, 我们掌握了前的高级任务升级到多级的进度的成绩。 在多级数据库中, 数据库中, 数据中, 我们掌握了前的高级数据库中, 改进了前的进度的进度的进度的进度的进度的进度的进度, 。