Large datasets in NLP suffer from noisy labels, due to erroneous automatic and human annotation procedures. We study the problem of text classification with label noise, and aim to capture this noise through an auxiliary noise model over the classifier. We first assign a probability score to each training sample of having a noisy label, through a beta mixture model fitted on the losses at an early epoch of training. Then, we use this score to selectively guide the learning of the noise model and classifier. Our empirical evaluation on two text classification tasks shows that our approach can improve over the baseline accuracy, and prevent over-fitting to the noise.
翻译:NLP的大型数据集因错误的自动和人工批注程序而受到噪音标签的影响。我们研究了带有标签噪音的文字分类问题,目的是通过分类器上的辅助噪音模型捕捉这种噪音。我们首先通过在早期培训阶段损失时安装的乙型混合物模型,为每个培训样本的噪音标签确定一个概率分数。然后,我们用这个分数有选择地指导噪音模型和分类器的学习。我们对两种文本分类任务的经验评估表明,我们的方法可以改进基线准确性,防止噪音过度适应。