The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of "supervision deficiencies", namely: poor quality, non adaptability, and insufficient quantity of labels. Regarding quality, label noise can be of different types, including completely-at-random, at-random or even not-at-random. All these kinds of label noise are addressed separately in the literature, leading to highly specialized approaches. This paper proposes an original, encompassing, view of Weakly Supervised Learning, which results in the design of generic approaches capable of dealing with any kind of label noise. For this purpose, an alternative setting called "Biquality data" is used. It assumes that a small trusted dataset of correctly labeled examples is available, in addition to an untrusted dataset of noisy examples. In this paper, we propose a new reweigthing scheme capable of identifying noncorrupted examples in the untrusted dataset. This allows one to learn classifiers using both datasets. Extensive experiments that simulate several types of label noise and that vary the quality and quantity of untrusted examples, demonstrate that the proposed approach outperforms baselines and state-of-the-art approaches.
翻译:微弱监督学习(WSL)领域最近出现了受欢迎程度的激增,许多论文涉及不同类型的“监督缺陷”,即:质量差、不适应性和标签数量不足。关于质量,标签噪音可以是不同类型的,包括完全随机、随机、甚至是非随机的。所有这些类型的标签噪音都在文献中单独处理,导致高度专业化的方法。本文件提出了一个原始的、包含的、从弱监督学习的观点,从而导致设计出能够处理任何标签噪音的通用方法。为此,使用了称为“Bilenical Data”的替代设置。它假定除了一个不可信的噪音实例外,还有一套小的、可信的、有正确标签的例子的数据集。在本文中,我们提出了一个新的重温格方法,能够找出不可靠数据集中不固定的例子。这使人们能够用两种数据集学习分类方法。广泛的实验模拟了几种标签噪音,并改变了不可靠的基准方法的质量和数量。