Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which are rarely satisfied in real-world problems. In this paper, we propose a novel semi-supervised AUC optimization method that does not require such restrictive assumptions. We first develop an AUC optimization method based only on positive and unlabeled data (PU-AUC) and then extend it to semi-supervised learning by combining it with a supervised AUC optimization method. We theoretically prove that, without the restrictive distributional assumptions, unlabeled data contribute to improving the generalization performance in PU and semi-supervised AUC optimization methods. Finally, we demonstrate the practical usefulness of the proposed methods through experiments.
翻译:在接收器操作特征曲线(AUC)下,最大限度地扩大接收器操作特征曲线(AUC)下的区域是处理不平衡分类的一种标准方法。迄今为止,已经开发出各种受监督的AUC优化方法,这些方法还扩大到半监督的情景,以应对小型抽样问题。然而,现有的半监督的AUC优化方法依赖于强有力的分配假设,在现实世界问题中,这些假设很少得到满足。在本文件中,我们提出了一个新的半监督的AUC优化方法,不需要这种限制性假设。我们首先开发了AUC优化方法,仅以正值和无标签数据(PU-AUC)为基础,然后通过将其与受监督的AUC优化方法相结合,将其推广到半监督的学习中。我们理论上证明,没有限制性的分配假设,无标签数据有助于改进PU和半监督的AUC优化方法的通用性绩效。最后,我们通过实验展示了拟议方法的实用性。