Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.
翻译:获得大量附加说明的数据集对于培训成功的机器学习模式至关重要,而且往往是实践中的一个瓶颈。 薄弱的监督为在没有地面事实说明的情况下制作标签的数据集提供了一个很有希望的替代方法,通过使用多种吵闹的杂音来制作概率性标签。 这一过程可以推广到大型的数据集,并展示了在诸如保健和电子商务等不同领域的先进性能。 从用户产生的超自然学中学习的一个实际问题是,它们的创造需要那些手工制作者的创新、远见和域域内专门知识,而这一过程可能是乏味的和主观的。 我们开发了第一个互动性薄弱的监督框架,在这个框架中,我们用一种方法提出超常学,并从用户对每种拟议的超自然学的反馈中学习。 我们的实验表明,只需要少量的反馈迭代来培训模型就可以在没有地面真相培训标签的情况下实现高度竞争性的测试集性能。 我们进行的用户研究表明,用户能够有效地提供关于超自然学的反馈,测试集结果跟踪模拟或触摸的性能。