As annotations of data can be scarce in large-scale practical problems, leveraging unlabelled examples is one of the most important aspects of machine learning. This is the aim of semi-supervised learning. To benefit from the access to unlabelled data, it is natural to diffuse smoothly knowledge of labelled data to unlabelled one. This induces to the use of Laplacian regularization. Yet, current implementations of Laplacian regularization suffer from several drawbacks, notably the well-known curse of dimensionality. In this paper, we provide a statistical analysis to overcome those issues, and unveil a large body of spectral filtering methods that exhibit desirable behaviors. They are implemented through (reproducing) kernel methods, for which we provide realistic computational guidelines in order to make our method usable with large amounts of data.
翻译:由于数据说明在大规模实际问题中可能很少见,利用未贴标签的例子是机器学习的最重要方面之一,这是半监督学习的目的。为了从无标签数据获取中受益,自然地将标签数据知识顺利地传播给无标签数据。这导致使用拉巴拉西亚规范化。然而,目前实施拉巴拉西亚规范化有几个缺点,特别是众所周知的维度诅咒。在本文中,我们提供了统计分析,以克服这些问题,并公布了大量显示可取行为的光谱过滤方法。这些方法是通过(复制)内核方法实施的,我们为此提供了现实的计算指南,以便使我们的方法能够用大量的数据来使用。