Aggregating multiple sources of weak supervision (WS) can ease the data-labeling bottleneck prevalent in many machine learning applications, by replacing the tedious manual collection of ground truth labels. Current state of the art approaches that do not use any labeled training data, however, require two separate modeling steps: Learning a probabilistic latent variable model based on the WS sources -- making assumptions that rarely hold in practice -- followed by downstream model training. Importantly, the first step of modeling does not consider the performance of the downstream model. To address these caveats we propose an end-to-end approach for directly learning the downstream model by maximizing its agreement with probabilistic labels generated by reparameterizing previous probabilistic posteriors with a neural network. Our results show improved performance over prior work in terms of end model performance on downstream test sets, as well as in terms of improved robustness to dependencies among weak supervision sources.
翻译:将多种薄弱的监管来源(WS)聚合在一起,可以缓解许多机器学习应用中普遍存在的数据标签瓶颈,取代冗长的地面真相标签手工收集。然而,目前不使用任何标签培训数据的最新做法需要两个不同的示范步骤:学习一种基于WS来源的概率潜伏变量模型 -- -- 假设很少在实践中有效 -- -- 并随后进行下游模式培训。重要的是,建模的第一步不考虑下游模式的性能。为了解决这些告诫,我们建议采用一个端到端办法,直接学习下游模式,通过尽可能扩大协议,与以神经网络重新校准以前的概率后继器生成的概率标签。我们的结果显示,在下游测试组的最终模型性能方面,以及在改进对薄弱的监督源之间依赖性方面,与以往工作相比,业绩有所改善。