Recent progress in semi- and self-supervised learning has caused a rift in the long-held belief about the need for an enormous amount of labeled data for machine learning and the irrelevancy of unlabeled data. Although it has been successful in various data, there is no dominant semi- and self-supervised learning method that can be generalized for tabular data (i.e. most of the existing methods require appropriate tabular datasets and architectures). In this paper, we revisit self-training which can be applied to any kind of algorithm including the most widely used architecture, gradient boosting decision tree, and introduce curriculum pseudo-labeling (a state-of-the-art pseudo-labeling technique in image) for a tabular domain. Furthermore, existing pseudo-labeling techniques do not assure the cluster assumption when computing confidence scores of pseudo-labels generated from unlabeled data. To overcome this issue, we propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels so that more reliable pseudo-labels which lie in high density regions can be obtained. We exhaustively validate the superiority of our approaches using various models and tabular datasets.
翻译:在半和自监督的学习方面,最近的进展导致长期认为需要大量贴标签的数据用于机器学习的长期观念出现裂痕,而且未贴标签的数据也无关紧要。虽然在各种数据中取得了成功,但目前没有主要的半和自监督的学习方法可以普遍用于表格数据(即大多数现有方法需要适当的表格数据集和结构)。在本文件中,我们重新审视了自我培训,这种培训可以适用于任何类型的算法,包括最广泛使用的架构、梯度加速决策树等,并且为表格域引入了课程假标签(在图像中采用最先进的伪标签技术)。此外,在计算从未贴标签数据中产生的伪标签的可信度分数时,现有的伪标签技术并不能保证集群假设。为了克服这一问题,我们建议一种新颖的假标签方法,根据假标签的可能性来规范信任分数,以便获得在高密度区域存在的更可靠的伪标签。我们用各种模型和数据来彻底验证我们方法的优越性。</s>