通过代用数据集分类从多个无标签数据集中得出二进制分类 (Binary Classification from Multiple Unlabeled Datasets via Surrogate Set Classification)

To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing risk-consistent methods are theoretically grounded with high flexibility, they can learn only from two U sets. In this paper, we propose a new approach for binary classification from $m$ U-sets for $m\ge2$. Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC), which is aimed at predicting from which U set each observed data is drawn. SSC can be solved by a standard (multi-class) classification method, and we use the SSC solution to obtain the final binary classifier through a certain linear-fractional transformation. We built our method in a flexible and efficient end-to-end deep learning framework and prove it to be classifier-consistent. Through experiments, we demonstrate the superiority of our proposed method over state-of-the-art methods.

翻译：为了应对高注解成本,目前仅从监管薄弱的数据中培训一个分类师已引起大量关注。在各种办法中,从完全不受监督的分类中加强监督是一个很有希望的方向,它通常使用类前导作为唯一的监督,从未贴标签的(U)数据集中培训一个二进制分类师。虽然现有的符合风险的方法在理论上以高度灵活性为基础,但只能从两套U系列中学习。在本文件中,我们提出了一个从$m\ge2$的美元中进行二进制分类的新办法。我们的主要想法是考虑一个称为代用分类的辅助性分类,即代用SSC(SSC),其目的是预测U设定的每个观察到的数据来自哪一类。 SSC可以通过标准(多级)分类方法解决,我们使用SSC解决方案通过某种线性折叠变方法获得最终的二进制分类师。我们用一个灵活高效的端到端深层次的学习框架构建了我们的方法,并证明它是分类一致的。我们通过实验,展示了我们拟议方法优于状态的方法。