通过代用数据集分类从多个无标签数据集中得出二进制分类 (Binary Classification from Multiple Unlabeled Datasets via Surrogate Set Classification)

To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing risk-consistent methods are theoretically grounded with high flexibility, they can learn only from two U sets. In this paper, we propose a new approach for binary classification from m U-sets for $m\ge2$. Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC), which is aimed at predicting from which U set each observed data is drawn. SSC can be solved by a standard (multi-class) classification method, and we use the SSC solution to obtain the final binary classifier through a certain linear-fractional transformation. We built our method in a flexible and efficient end-to-end deep learning framework and prove it to be classifier-consistent. Through experiments, we demonstrate the superiority of our proposed method over state-of-the-art methods.

翻译：为了应付高注解成本,目前仅从监管薄弱的数据中培训一个分类员已引起大量关注。在各种办法中,从完全不受监督的分类中加强监督是一个很有希望的方向,通常使用类前导作为唯一的监督,从未贴标签的(U)数据集中培训一个二进制分类员。虽然现有的符合风险的方法在理论上以高度灵活性为基础,但只能从两套U系列中学习。在本文件中,我们提出了一个从m U-sets分类中进行二进制的新方法,用于美元。我们的主要想法是考虑一个称为代用分类的辅助性分类任务,即代用SSC(SSC),其目的是预测U所观察到的每项数据来自的分类。 SSC可以通过标准(多级)分类方法解决,我们使用SSC解决方案通过某种线性偏差转换获得最终的二进制分类员。我们用一个灵活高效的端到端深层次的学习框架构建了我们的方法,并证明它是分级的。我们通过实验展示了我们拟议方法优于状态方法的优势。