Deep learning-based text classification models need abundant labeled data to obtain competitive performance. Unfortunately, annotating large-size corpus is time-consuming and laborious. To tackle this, multiple researches try to use data augmentation to expand the corpus size. However, data augmentation may potentially produce some noisy augmented samples. There are currently no works exploring sample selection for augmented samples in nature language processing field. In this paper, we propose a novel self-training selection framework with two selectors to select the high-quality samples from data augmentation. Specifically, we firstly use an entropy-based strategy and the model prediction to select augmented samples. Considering some samples with high quality at the above step may be wrongly filtered, we propose to recall them from two perspectives of word overlap and semantic similarity. Experimental results show the effectiveness and simplicity of our framework.
翻译:深层次的基于学习的文本分类模型需要大量标签数据才能取得竞争性的性能。 不幸的是,批注大型体体耗时费时费力。 要解决这个问题,多项研究试图利用数据增强来扩大体积规模。然而,数据增强可能会产生一些噪音增加的样本。目前没有研究自然语言处理领域增量样本的样本选择工作。在本文中,我们建议建立一个新型自我培训选择框架,由两个选择者选择数据增强的高质量样本。具体地说,我们首先使用基于酶的战略和模型预测来选择增量样本。考虑到上述步骤中一些高质量样本可能被错误过滤,我们提议从文字重叠和语义相似性这两个角度来回顾这些样本。实验结果显示了我们框架的有效性和简洁性。