生成性概率模型用于弱监督学习的基准测试 (A Benchmark Generative Probabilistic Model for Weak Supervised Learning)

Finding relevant and high-quality datasets to train machine learning models is a major bottleneck for practitioners. Furthermore, to address ambitious real-world use-cases there is usually the requirement that the data come labelled with high-quality annotations that can facilitate the training of a supervised model. Manually labelling data with high-quality labels is generally a time-consuming and challenging task and often this turns out to be the bottleneck in a machine learning project. Weak Supervised Learning (WSL) approaches have been developed to alleviate the annotation burden by offering an automatic way of assigning approximate labels (pseudo-labels) to unlabelled data based on heuristics, distant supervision and knowledge bases. We apply probabilistic generative latent variable models (PLVMs), trained on heuristic labelling representations of the original dataset, as an accurate, fast and cost-effective way to generate pseudo-labels. We show that the PLVMs achieve state-of-the-art performance across four datasets. For example, they achieve 22% points higher F1 score than Snorkel in the class-imbalanced Spouse dataset. PLVMs are plug-and-playable and are a drop-in replacement to existing WSL frameworks (e.g. Snorkel) or they can be used as benchmark models for more complicated algorithms, giving practitioners a compelling accuracy boost.

翻译：寻找相关且高质量的数据集来训练机器学习模型是从业者面临的主要瓶颈。此外，为了应对野心勃勃的实际应用场景，通常要求数据附带高质量的注释，以便于训练有监督的模型。手动标注高质量标签通常是一项耗时且具有挑战性的任务，往往会成为机器学习项目的瓶颈。弱监督学习（WSL）方法已经被开发出来，通过基于启发式、远程监督和知识库的方法，为未标记数据自动分配近似标签（伪标签），以减轻注释负担。我们应用基于启发式标注表示的概率生成潜变量模型（PLVM）作为一种准确、快速和经济的生成伪标签的方法。我们展示PLVM在四个数据集上实现了最先进的性能。例如，它们的F1得分比Snorkel在类别不平衡的Spouse数据集中高22个百分点。PLVM是即插即用的，可以作为现有WSL框架（例如Snorkel）的替代品，也可以作为更复杂算法的基准模型，为从业者提供令人信服的准确性提升。

相关内容

弱监督学习

关注 7

弱监督学习：监督学习的一种。大致分3类，第一类是不完全监督（incomplete supervision），即，只有训练集的一个（通常很小的）子集是有标签的，其他数据则没有标签。这种情况发生在各类任务中。例如，在图像分类任务中，真值标签由人类标注者给出的。从互联网上获取巨量图片很容易，然而考虑到标记的人工成本，只有一个小子集的图像能够被标注。第二类是不确切监督（inexact supervision），即，图像只有粗粒度的标签。第三种是不准确的监督（inaccurate supervision），模型给出的标签不总是真值。出现这种情况的常见原因有，图片标注者不小心或比较疲倦，或者某些图片就是难以分类。