Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to https://github.com/ritaranx/NeST.
翻译:在有限的监督下培训深神经网络(DNNS)是一个受欢迎的研究课题,因为它可以大大减轻批注负担。自我培训成功地应用于半监督的学习任务,但自我培训的一个缺点是,它容易受到不正确的假标签标签标签标签的标签噪音的影响。由于贴有类似标签的样本往往具有相似的表示方式,我们开发了一种基于邻居的样本选择方法,以解决噪音假标签问题。我们通过汇总抽样选择期间不同回合的预测,进一步稳定了自我培训。对八项任务的实验表明,我们拟议的方法在平均文本和图表数据集的绩效收益方面超过了1.83%和2.51%的最强的自我培训基线。我们的进一步分析表明,我们拟议的数据选择战略将假标签的噪音减少36.8%,比最佳基线节省了57.3%的时间。我们的代码和附录将上传到 https://github.com/ritaranx/NERST。