Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.
翻译:创建标签式的训练组已成为机器学习的主要障碍之一。 为了解决这个问题,最近的弱监督(WS)框架综合了来自多个潜在噪音监督源的培训标签。 但是,现有的框架仅限于与目标任务具有相同产出空间的监督源。 为了扩大可用源的范围,我们制定了弱间接监督(WIS),这是基于具有不同产出标签空间的间接监督源自动合成培训标签的新研究问题。为了克服不匹配产出空间的挑战,我们制定了一种概率模型法,即PLRM,它利用用户提供的标签关系来模拟和利用间接监督来源。此外,我们提供了一个理论性测试标准,说明PLRM对隐性标签的区别性,以及一个总化约束。在图像和文本分类任务以及工业广告应用方面,我们通过将2%-9%的比值超过基线来展示PLRM的优势。