Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
翻译:现有的薄弱监督方法使用微弱信号覆盖的所有数据来训练一个分类员。 我们从理论上和实验上都表明,这并不总是最理想的。 直观地说,在标签标签薄弱的数据数量与标签薄弱的精确度之间有一个权衡。 我们探索这一权衡,将预先培训的数据表述与剪裁统计(Muhlenbach等人,2004年)结合起来,以(理想地)选择标签薄弱的培训数据中高质量的子集。 子集选择适用于任何标签模型和分类器,并且非常简单,可以插入现有的薄弱监督管道,只需要几行代码。 我们展示子集选择方法可以改善许多标签模型、分类器和数据集的薄弱监督性能。 使用不那么薄弱的标签数据可以提高基准任务中薄弱监督管道的精度,提高幅度高达19%( 绝对) 。