In many applications, training machine learning models involves using large amounts of human-annotated data. Obtaining precise labels for the data is expensive. Instead, training with weak supervision provides a low-cost alternative. We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals, while also considering features of the training data to produce accurate labels for training. Our method searches over classifiers of the data representation to find plausible labelings. We call this paradigm data consistent weak supervision. A key facet of our framework is that we are able to estimate labels for data examples low or no coverage from the weak supervision. In addition, we make no assumptions about the joint distribution of the weak signals and true labels of the data. Instead, we use weak signals and the data features to solve a constrained optimization that enforces data consistency among the labels we generate. Empirical evaluation of our method on different datasets shows that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.
翻译:在许多应用中,培训机器学习模式涉及使用大量人文附加说明的数据。为数据获取精确标签费用昂贵。相反,在监管薄弱的情况下,培训提供了低成本的替代办法。我们建议采用新的薄弱监督算法,处理吵闹标签,即弱信号,同时考虑培训数据的特点,以生成准确的培训标签。我们的方法搜索数据代表的分类人员,以寻找可信的标签。我们称这种模式数据为持续薄弱的监管。我们框架的一个关键方面是,我们能够估计数据示例的标签低或没有来自薄弱监管的覆盖。此外,我们没有假设数据弱信号和真实标签的联合分布。相反,我们使用弱信号和数据特征来解决限制优化的问题,以强化我们所制作的标签的数据一致性。我们对不同数据集的方法进行的经验性评估表明,它大大超越了文本和图像分类工作方面的最新监管方法。