Label noise is ubiquitous in various machine learning scenarios such as self-labeling with model predictions and erroneous data annotation. Many existing approaches are based on heuristics such as sample losses, which might not be flexible enough to achieve optimal solutions. Meta learning based methods address this issue by learning a data selection function, but can be hard to optimize. In light of these pros and cons, we propose Selection-Enhanced Noisy label Training (SENT) that does not rely on meta learning while having the flexibility of being data-driven. SENT transfers the noise distribution to a clean set and trains a model to distinguish noisy labels from clean ones using model-based features. Empirically, on a wide range of tasks including text classification and speech recognition, SENT improves performance over strong baselines under the settings of self-training and label corruption.
翻译:在各种机器学习情景中,如自我标注模型预测和错误的数据注释等,标签噪音是普遍存在的。许多现有办法基于诸如抽样损失等超自然学,这些抽样损失可能不够灵活,不足以实现最佳解决办法。元学习方法通过学习数据选择功能来解决这一问题,但可能很难优化。鉴于这些利弊,我们提议选择强化噪音标签培训(SENT)不依赖元学习,同时具有数据驱动的灵活性。SENT将噪音分布转移到一个干净的数据集,并培训一个模型,用基于模型的特征区分噪音标签和清洁标签。在包括文字分类和语音识别在内的广泛任务方面,SENT在自我培训和标签腐败的环境下,在强大的基线上提高表现。