Weakly supervised text classification methods typically train a deep neural classifier based on pseudo-labels. The quality of pseudo-labels is crucial to final performance but they are inevitably noisy due to their heuristic nature, so selecting the correct ones has a huge potential for performance boost. One straightforward solution is to select samples based on the softmax probability scores in the neural classifier corresponding to their pseudo-labels. However, we show through our experiments that such solutions are ineffective and unstable due to the erroneously high-confidence predictions from poorly calibrated models. Recent studies on the memorization effects of deep neural models suggest that these models first memorize training samples with clean labels and then those with noisy labels. Inspired by this observation, we propose a novel pseudo-label selection method LOPS that takes learning order of samples into consideration. We hypothesize that the learning order reflects the probability of wrong annotation in terms of ranking, and therefore, propose to select the samples that are learnt earlier. LOPS can be viewed as a strong performance-boost plug-in to most of existing weakly-supervised text classification methods, as confirmed in extensive experiments on four real-world datasets.
翻译:微弱监管的文本分类方法通常以假标签为基础,培养深神经分类器。 伪标签的质量对于最终性能至关重要, 但由于其杂乱性质, 伪标签的质量对最终性能至关重要, 但这些模型不可避免地噪音, 所以选择正确的标签有很大的性能提升潜力。 一个直接的解决办法就是根据神经分类器中与其假标签相对应的软分子概率分数来选择样本。 然而, 我们通过实验显示, 这种解决方案无效和不稳定, 原因是对错误的校准模型的错误高可信度预测。 最近对深神经模型的记忆化效应进行的研究表明, 这些模型首先用清洁标签对培训样本进行记忆化, 然后用吵闹标签进行记忆化。 我们受到这一观察的启发, 我们提出了一个新颖的伪标签选择方法 LOPS, 将样本的学习顺序考虑在内。 我们假设, 学习顺序反映了排名错误的概率, 因此, 提议选择早期学到的样本。 LOPS 可以被视为对大多数现有弱小的文本分类方法进行强大的性能加固的插插插。