To investigate the role of linguistic knowledge in data augmentation (DA) for Natural Language Processing (NLP), we designed two adapted DA programs and applied them to LCQMC (a Large-scale Chinese Question Matching Corpus) for a binary Chinese question matching classification task. The two DA programs produce augmented texts by five simple text editing operations (or DA techniques), largely irrespective of language generation rules, but one is enhanced with a pre-trained n-gram language model to fuse it with prior linguistic knowledge. We then trained four neural network models (BOW, CNN, LSTM, and GRU) and a pre-trained model (ERNIE-Gram) on the LCQMCs train sets of varying size as well as the related augmented train sets produced by the two DA programs. The results show that there are no significant performance differences between the models trained on the two types of augmented train sets, both when the five DA techniques are applied together or separately. Moreover, due to the inability of the five DA techniques to make strictly paraphrastic augmented texts, the results indicate the need of sufficient amounts of training examples for the classification models trained on them to mediate the negative impact of false matching augmented text pairs and improve performance, a limitation of random text editing perturbations used as a DA approach. Similar results were also obtained for English.
翻译:为了调查语言知识在自然语言处理(NLP)数据增强(DA)中的作用,我们设计了两个经过调整的DA程序,并将其应用于LCQMC(一个大规模的中国问题匹配Corpus),用于一个中国问题匹配的二进制分类任务。两个DA方案通过五个简单的文字编辑操作(或DA技术)来增加文本,基本上不考虑语言生成规则,但有一个通过预先培训的 ngram 语言模式得到加强,使之与先前的语言知识相结合。我们随后培训了四个神经网络模型(BOW、CNN、LSTM和GRU)和一个预先培训的模式(ERNIE-Gram),分别用于不同规模的LCQMC列列车和两个DA方案制作的相关强化列车。结果显示,在经过培训的关于两种强化列车的模型之间没有显著的性能差异,这五DA技术是同时或分开应用的。此外,由于五个DA技术无法制作严格的语音增强文本,结果表明需要为经过严格培训的分类模型提供足够培训的训练的范例,同时根据媒体的文本改进了它们。