用于有限标签数据学习的数据驱动算法 (Data driven algorithms for limited labeled data learning)

We consider a novel data driven approach for designing learning algorithms that can effectively learn with only a small number of labeled examples. This is crucial for modern machine learning applications where labels are scarce or expensive to obtain. We focus on graph-based techniques, where the unlabeled examples are connected in a graph under the implicit assumption that similar nodes likely have similar labels. Over the past decades, several elegant graph-based semi-supervised and active learning algorithms for how to infer the labels of the unlabeled examples given the graph and a few labeled examples have been proposed. However, the problem of how to create the graph (which impacts the practical usefulness of these methods significantly) has been relegated to domain-specific art and heuristics and no general principles have been proposed. In this work we present a novel data driven approach for learning the graph and provide strong formal guarantees in both the distributional and online learning formalizations. We show how to leverage problem instances coming from an underlying problem domain to learn the graph hyperparameters from commonly used parametric families of graphs that perform well on new instances coming from the same domain. We obtain low regret and efficient algorithms in the online setting, and generalization guarantees in the distributional setting. We also show how to combine several very different similarity metrics and learn multiple hyperparameters, providing general techniques to apply to large classes of problems. We expect some of the tools and techniques we develop along the way to be of interest beyond semi-supervised and active learning, for data driven algorithms for combinatorial problems more generally.

翻译：我们考虑一种创新的数据驱动方法来设计学习算法,这种算法只能用少量标签实例有效学习。这对于现代机器学习应用程序来说至关重要,因为标签稀缺或昂贵。我们侧重于基于图表的技术,其中未贴标签的例子在图中相连,其隐含的假设是类似的节点可能具有类似的标签。在过去几十年里,一些优雅的基于图形的半监督性和活跃的学习算法,用来推断来自图表的未贴标签实例的标签和几个贴标签的例子。然而,对于如何创建图表(这大大影响这些方法的实际用途)的问题,已经降格为基于域的艺术和超自然学,没有提出一般原则。在这项工作中,我们提出了一种新的数据驱动方法,用于学习图表,在分布和在线学习正规化方面提供了强有力的正式保证。我们展示了如何利用来自一个基本问题域的问题实例从常用的图表超常使用比数组学习,在来自同一域的新实例上表现得非常好的图表,我们一般地学习了多度和高效率的数学,我们一般地学习了不同的数学,我们一般地学习了不同的数学,我们一般地学习了不同的数学,我们学习了不同的理论,我们学习了不同的数学,我们又学会和标准,我们学习了不同的研究了不同的数学。