Labeling cost is often expensive and is a fundamental limitation of supervised learning. In this paper, we study importance labeling problem, in which we are given many unlabeled data and select a limited number of data to be labeled from the unlabeled data, and then a learning algorithm is executed on the selected one. We propose a new importance labeling scheme that can effectively select an informative subset of unlabeled data in least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). We analyze the generalization error of gradient descent combined with our labeling scheme and show that the proposed algorithm achieves the optimal rate of convergence in much wider settings and especially gives much better generalization ability in a small label noise setting than the usual uniform sampling scheme. Numerical experiments verify our theoretical findings.
翻译:标签成本通常非常昂贵,是监督学习的根本限制。 在本文中,我们研究了重要标签问题,我们得到了许多未贴标签的数据,从未贴标签的数据中选择了数量有限的数据,然后对选定的数据进行了学习算法。我们提出了一个新的重要标签方案,在复制 Kernel Hilbert Space (RKHS) 中,可以在最小方位回归中有效地选择一个未贴标签数据的信息子集。我们分析了梯度下降的普遍错误,加上我们的标签方案,并表明拟议的算法在更广泛的环境中实现了最佳的趋同率,特别是比通常的统一抽样方案在小标签噪音环境中提供了更好的普及能力。数字实验证实了我们的理论结论。