Many promising applications of supervised machine learning face hurdles in the acquisition of labeled data in sufficient quantity and quality, creating an expensive bottleneck. To overcome such limitations, techniques that do not depend on ground truth labels have been studied, including weak supervision and generative modeling. While these techniques would seem to be usable in concert, improving one another, how to build an interface between them is not well-understood. In this work, we propose a model fusing programmatic weak supervision and generative adversarial networks and provide theoretical justification motivating this fusion. The proposed approach captures discrete latent variables in the data alongside the weak supervision derived label estimate. Alignment of the two allows for better modeling of sample-dependent accuracies of the weak supervision sources, improving the estimate of unobserved labels. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels. Additionally, its learned latent variables can be inspected qualitatively. The model outperforms baseline weak supervision label models on a number of multiclass image classification datasets, improves the quality of generated images, and further improves end-model performance through data augmentation with synthetic samples.
翻译:监督机器学习的许多大有希望的应用在获取足够数量和质量的标签数据方面面临着障碍,造成了昂贵的瓶颈。为了克服这些限制,已经研究了不依赖地面真相标签的技术,包括薄弱的监督和基因模型。虽然这些技术似乎可以协同使用,相互改进,但如何在它们之间建立接口并没有得到很好理解。在这项工作中,我们提出了一个模型,以显示方案薄弱的监管和基因对抗网络,并提供理论上的理由来推动这种聚合。拟议方法在薄弱的监督得出的标签估计值的同时,也捕捉了数据中的离散潜在变量。两种方法的一致使得能够更好地建模依赖薄弱的监督源的样本的隐性,改进了未观察到的标签的估计数。这是第一个方法,通过监督薄弱的合成图象和假标签来增强数据。此外,可以对所学到的潜伏变量进行定性检查。模型在多级图像分类数据集上超越基线薄弱的监督标签模型,改进了生成图像的质量,并通过合成样品进行数据增强来进一步改进终端模型的性能。