Recent Weak Supervision (WS) approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Second, WS datasets with the same name and base data often vary in terms of the labels and weak supervision sources used, a significant "hidden" source of evaluation variance. Finally, WS studies often diverge in terms of the evaluation protocol and ablations used. To address these problems, we introduce a benchmark platform, WRENCH, for thorough and standardized evaluation of WS approaches. It consists of 22 varied real-world datasets for classification and sequence tagging; a range of real, synthetic, and procedurally-generated weak supervision sources; and a modular, extensible framework for WS evaluation, including implementations for popular WS methods. We use WRENCH to conduct extensive comparisons over more than 120 method variants to demonstrate its efficacy as a benchmark platform. The code is available at https://github.com/JieyuZ2/wrench.
翻译:近来的薄弱监督(WS)方法在缓解标签培训数据方面的瓶颈方面取得了广泛成功,使标签培训数据能够通过综合来自多个潜在繁琐的监督来源的标签,进行机器学习,从而对标签进行整合,从而实现机器学习,这方面取得了广泛的成功。然而,对这些方法进行适当的衡量和分析仍是一项挑战。首先,现有工作中使用的数据集往往是私人的和/或习惯的,限制了标准化。第二,名称和基准数据的WS数据集在使用的标签和薄弱监督来源方面往往各不相同,这是一个重要的“隐蔽”的评价差异源。最后,WS研究在评价协议和所使用的词汇方面往往存在差异。为了解决这些问题,我们采用了一个基准平台(WRENCH),用于对WS方法进行彻底和标准化的评价。它包括22个不同的真实世界数据集,用于分类和排序;一系列真实、合成和程序上产生的薄弱监督来源;以及一套模块化的、可扩展的WS评价框架,包括采用流行的WS方法。我们使用WREENCH对120多个方法变量进行广泛的比较,以证明其作为基准平台的功效。该代码可在 http://gyus/znov.