Data Augmentation (DA) is frequently used to automatically provide additional training data without extra human annotation. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out ``noisy'' data. However, those filtered examples may still contain useful information, and dropping them completely causes loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. A simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts to further prevent overfitting on noisy labels. Our method can be applied to augmentation techniques in general and can consistently improve the performance on both text classification and question-answering tasks.
翻译:数据增强(DA)经常被用来自动提供额外的培训数据,而没有额外的人工说明。但是,数据增强可能引入不利于培训的噪音数据。为了保证扩大数据的质量,现有方法要么假设扩大数据中不存在噪音,采用一致性培训,要么使用简单的超常性学,如培训损失和多样性限制,以过滤“Nnoisy”数据。然而,这些过滤实例可能仍然含有有用的信息,并完全导致监督信号的丢失。在本文中,基于原数据集比扩大的数据更清洁的假设,我们建议采用一种数据增强的实时除尘技术,从经过清洁原始数据培训的有机教师模型提供的软强化标签中学习。一个简单的自我正规化模块,使模型预测在两个不同的辍学者之间保持一致,以进一步防止过度安装噪音标签。我们的方法可以用于一般的增强技术,并且可以不断改进文本分类和问答任务的绩效。