Most previous methods for text data augmentation are limited to simple tasks and weak baselines. We explore data augmentation on hard tasks (i.e., few-shot natural language understanding) and strong baselines (i.e., pretrained models with over one billion parameters). Under this setting, we reproduced a large number of previous augmentation methods and found that these methods bring marginal gains at best and sometimes degrade the performance much. To address this challenge, we propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data. Central to the idea of FlipDA is the discovery that generating label-flipped data is more crucial to the performance than generating label-preserved data. Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness---it substantially improves many tasks while not negatively affecting the others.
翻译:文本数据增强方法大多局限于简单的任务和薄弱的基线。 我们探索硬任务(即少数的自然语言理解)和强基线(即超过10亿参数的预培训模型)的数据增强。 在这种背景下,我们复制了大量先前的增强方法,发现这些方法在最大程度上带来了边际收益,有时会大大降低性能。为了应对这一挑战,我们建议一种新的数据增强方法FlipDA, 共同使用基因化模型和一个分类器生成标签被涂抹的数据。 FlipDA概念的核心是发现生成标签被涂抹的数据比生成标签保存的数据更关键于性能。 实验显示,FlipDA在有效性和稳健性-it之间实现了良好的平衡,极大地改善了许多任务,同时不影响其他任务。