Since data is the fuel that drives machine learning models, and access to labeled data is generally expensive, semi-supervised methods are constantly popular. They enable the acquisition of large datasets without the need for too many expert labels. This work combines self-labeling techniques with active learning in a selective sampling scenario. We propose a new method that builds an ensemble classifier. Based on an evaluation of the inconsistency of the decisions of the individual base classifiers for a given observation, a decision is made on whether to request a new label or use the self-labeling. In preliminary studies, we show that naive application of self-labeling can harm performance by introducing bias towards selected classes and consequently lead to skewed class distribution. Hence, we also propose mechanisms to reduce this phenomenon. Experimental evaluation shows that the proposed method matches current selective sampling methods or achieves better results.
翻译:由于数据是驱动机器学习模型的燃料,而且获取标签数据一般是昂贵的,因此使用半监督方法经常很受欢迎。它们使得获得大型数据集无需过多的专家标签即可。这项工作将自我标签技术与在选择性抽样情况下的积极学习结合起来。我们建议了一种新的方法,以建立一个混合分类器。根据对个别基准分类器的决定在特定观察中不一致的评估,决定是要求新标签还是使用自贴标签。在初步研究中,我们表明,对选定类别引入偏见,自我标签的天真应用会损害业绩,从而导致分类分配偏斜。因此,我们还提出了减少这种现象的机制。实验性评估表明,拟议的方法与目前选择性抽样方法相匹配,或取得更好的结果。