While semi-supervised learning (SSL) has received tremendous attentions in many machine learning tasks due to its successful use of unlabeled data, existing SSL algorithms use either all unlabeled examples or the unlabeled examples with a fixed high-confidence prediction during the training progress. However, it is possible that too many correct/wrong pseudo labeled examples are eliminated/selected. In this work we develop a simple yet powerful framework, whose key idea is to select a subset of training examples from the unlabeled data when performing existing SSL methods so that only the unlabeled examples with pseudo labels related to the labeled data will be used to train models. The selection is performed at each updating iteration by only keeping the examples whose losses are smaller than a given threshold that is dynamically adjusted through the iteration. Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection and its theoretical guarantee. Specifically, we theoretically establish the convergence rate of Dash from the view of non-convex optimization. Finally, we empirically demonstrate the effectiveness of the proposed method in comparison with state-of-the-art over benchmarks.
翻译:虽然半监督的学习(SSL)由于成功使用未贴标签的数据而在许多机器学习任务中受到极大关注,但现有的SSL算法要么使用所有未贴标签的例子,要么使用在培训进展期间有固定的高度自信预测的未贴标签的例子。然而,可能消除/选择过多的正确/错误的假标签例子。在这项工作中,我们开发了一个简单而有力的框架,其关键想法是在执行现有的SSL方法时从未贴标签的数据中选择一组培训实例,以便只使用带有与标签数据有关的伪标签的未贴标签例子来培训模型。在每次更新时,选择都是通过只保留损失小于通过迭热调整而动态调整的某一阈值的示例。我们提议的Dash方法在未贴标签数据选择及其理论保证方面具有适应性。具体地说,我们理论上从非集装箱优化的观点中确定了Dash的趋同率。最后,我们实证地展示了拟议方法与状态超标的基准相比较的有效性。