From two unlabeled (U) datasets with different class priors, we can train a binary classifier by empirical risk minimization, which is called UU classification. It is promising since UU methods are compatible with any neural network (NN) architecture and optimizer as if it is standard supervised classification. In this paper, however, we find that UU methods may suffer severe overfitting, and there is a high co-occurrence between the overfitting and the negative empirical risk regardless of datasets, NN architectures, and optimizers. Hence, to mitigate the overfitting problem of UU methods, we propose to keep two parts of the empirical risk (i.e., false positive and false negative) non-negative by wrapping them in a family of correction functions. We theoretically show that the corrected risk estimator is still asymptotically unbiased and consistent; furthermore we establish an estimation error bound for the corrected risk minimizer. Experiments with feedforward/residual NNs on standard benchmarks demonstrate that our proposed correction can successfully mitigate the overfitting of UU methods and significantly improve the classification accuracy.
翻译:从两个没有标签的(U)数据集中,不同类别前科不同,我们可以通过实验风险最小化来培训一个二分级(实验风险最小化),称为UU分类,这是很有希望的,因为UU方法与任何神经网络(NN)架构和优化相容,就像标准监督分类一样。然而,在本文中,我们发现UU方法可能严重超配,而且无论数据集、NNE架构和优化器如何,都存在高度的超配和负面经验风险。因此,为了减轻U方法的过度适应问题,我们提议将经验风险的两个部分(即虚假的正反)保留在纠正功能的组合中,不负(即假的正反的)非负风险部分。我们理论上表明,更正的风险估计值仍然在考虑上是不偏不倚和一致的;此外,我们在标准基准上对经修正的风险最小化的数值进行试验,显示我们提出的修正可以成功地减轻U方法的过度适应,并大大改进了分类的准确性。