A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time consuming to obtain. However, it has been shown that a small amount of labeled data, while insufficient to re-train a model, can be effectively used to generate human-interpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data, in a paradigm that is now commonly referred to as data programming. However, previous approaches to automatically generate LFs make no attempt to further use the given labeled data for model training, thus giving up opportunities for improved performance. Moreover, since the LFs are generated from a relatively small labeled dataset, they are prone to being noisy, and naively aggregating these LFs can lead to very poor performance in practice. In this work, we propose an LF based reweighting framework \ouralgo{} to solve these two critical limitations. Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that our algorithm significantly outperforms prior approaches on several text classification datasets.
翻译:受监督的机器学习中的一个关键瓶颈是需要大量标签数据,这些数据成本昂贵,耗时费时,才能获得。然而,已经表明,少量标签数据虽然不足以再培训模型,但能够有效地用于产生人类解释标签功能。而这些LF又被用来产生大量额外的杂音标签数据,目前通常称为数据编程的范式就是这种数据。然而,以前自动生成LF的方法并不试图进一步使用给定标签数据进行模型培训,从而放弃改进性能的机会。此外,由于LF数据来自一个相对较小的标签数据集,因此它们往往会变得吵闹,而天真地集聚这些LF值可能会在实践中造成非常差的性能。在这项工作中,我们建议以LF为基础的重标码框架\ouralgo ⁇ 解决这两个关键限制。我们的算法学习了一个联合模型,用于LF上岗的标签数据集与任何未标签的分类数据一起用于改进性能。此外,由于LFLF数据来自一个相对小的标码,因此很容易变得紧张地使用其精锐的压式,我们用一种前的压前的模法来彻底地影响它。