Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates from a variety of sources. WS is theoretically well understood for binary classification, where simple approaches enable consistent estimation of pseudolabel noise rates. Using this result, it has been shown that downstream models trained on the pseudolabels have generalization guarantees nearly identical to those trained on clean labels. While this is exciting, users often wish to use WS for structured prediction, where the output space consists of more than a binary or multi-class label set: e.g. rankings, graphs, manifolds, and more. Do the favorable theoretical properties of WS for binary classification lift to this setting? We answer this question in the affirmative for a wide range of scenarios. For labels taking values in a finite metric space, we introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions, providing a nearly-consistent noise rate estimator. For labels in constant-curvature Riemannian manifolds, we introduce new invariants that also yield consistent noise rate estimation. In both cases, when using the resulting pseudolabels in concert with a flexible downstream model, we obtain generalization guarantees nearly identical to those for models trained on clean data. Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest. Empirical evaluation validates our claims and shows the merits of the proposed method.
翻译:微弱监管( WS) 是一套丰富的技术, 通过汇总各种来源的简单获取但可能很吵的标签估计来产生假标签。 WS在理论上为二进制分类所理解, 而在理论上对二进制分类非常理解, 简单的方法可以对假标签噪音率进行一致的估计。 使用这一结果, 已经证明, 以伪标签培训的下游模型具有与清洁标签培训的相似的保证。 虽然这是令人兴奋的, 用户往往希望用WS来进行结构化预测, 其输出空间包含的不仅仅是二进制或多级标签: 例如, 排名、 图表、 元数据, 以及更多。 在二进制分类中, 采用WS的优劣理论性理论属性, 以便提高二进制分类的优异性, 我们用一个精确的模型, 并用一个稳定的标签 来显示我们不断的稳健的排序结果。