由数据驱动的半监督半监督学习 (Data driven semi-supervised learning)

We consider a novel data driven approach for designing learning algorithms that can effectively learn with only a small number of labeled examples. This is crucial for modern machine learning applications where labels are scarce or expensive to obtain. We focus on graph-based techniques, where the unlabeled examples are connected in a graph under the implicit assumption that similar nodes likely have similar labels. Over the past decades, several elegant graph-based semi-supervised learning algorithms for how to infer the labels of the unlabeled examples given the graph and a few labeled examples have been proposed. However, the problem of how to create the graph (which impacts the practical usefulness of these methods significantly) has been relegated to domain-specific art and heuristics and no general principles have been proposed. In this work we present a novel data driven approach for learning the graph and provide strong formal guarantees in both the distributional and online learning formalizations. We show how to leverage problem instances coming from an underlying problem domain to learn the graph hyperparameters from commonly used parametric families of graphs that perform well on new instances coming from the same domain. We obtain low regret and efficient algorithms in the online setting, and generalization guarantees in the distributional setting. We also show how to combine several very different similarity metrics and learn multiple hyperparameters, providing general techniques to apply to large classes of problems. We expect some of the tools and techniques we develop along the way to be of interest beyond semi-supervised learning, for data driven algorithms for combinatorial problems more generally.

翻译：我们考虑一种新的数据驱动方法来设计学习算法,这种算法只能用少量标签实例有效学习。这对于现代机器学习应用程序来说至关重要,因为标签稀缺或昂贵。我们侧重于基于图表的技术,其中未贴标签的例子在图中相连,其隐含假设类似节点可能具有类似的标签。在过去几十年里,我们提出了几种优雅的基于图形的半监督的学习算法,用于如何推断来自图中未贴标签的例子的标签和几个有标签的精度实例。然而,如何创建图表的问题(这大大影响到这些方法的实际用途)被降格为基于域的艺术和超自然学,而没有提出一般原则。在这项工作中,我们提出了一种新的数据驱动方法,用于学习图表,并在分配和在线学习正规化过程中提供强有力的正式保证。我们展示了如何利用来自一个基本问题域的问题实例来从通常使用的直线超直径直径直的直径直数组来学习一些新例子。我们一般域域域域新出现的新例子时,我们一般学习了低的遗憾和高效的算法,我们一般地学习了多种方法,在网上进行不同的推算。我们一般工具,我们学会了不同的推算。我们用不同的推算方法,在一般的推算方法中,我们学习了低感和高效率地学习了不同的研算方法,我们学习了各种方法,我们又在一般的研算方法。