Data augmentation is an effective approach to tackle over-fitting. Many previous works have proposed different data augmentations strategies for NLP, such as noise injection, word replacement, back-translation etc. Though effective, they missed one important characteristic of language--compositionality, meaning of a complex expression is built from its sub-parts. Motivated by this, we propose a compositional data augmentation approach for natural language understanding called TreeMix. Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences. Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data. Extensive experiments on text classification and SCAN demonstrate that TreeMix outperforms current state-of-the-art data augmentation methods.
翻译:数据增强是解决过度配置问题的有效办法。 许多先前的著作都提出了国家语言平台的不同数据增强战略,如噪音注入、换字、反译等。 尽管效果有效,但它们错过了语言组合的一个重要特征,复杂的表达方式的含义是从其子部分构建的。我们为此提出了一种组成数据增强方法,用于理解自然语言,称为树Mix。具体地说,TreaMix利用树群群对树进行分解以将句分解成组成子结构,以及混合数据增强技术,以重新测试它们产生新的句子。与以往的方法相比,树混合使生成的样本具有更大的多样性,并鼓励模型学习国家语言平台数据的构成性。关于文本分类的广泛实验和SCAN表明,TreamMix超越了目前的最新数据增强方法。