A bottleneck of binary classification from positive and unlabeled data (PU classification) is the requirement that given unlabeled patterns are drawn from the same distribution as the test distribution. However, such a requirement is often not fulfilled in practice. In this paper, we generalize PU classification to the class prior shift scenario, where the class prior of given unlabeled patterns is different from that of test unlabeled patterns. Based on the analysis of the Bayes optimal classifier, we show that given a test class prior, PU classification under class prior shift is equivalent to PU classification with asymmetric error, where the false positive penalty and the false negative penalty are different. Then, we propose two frameworks to handle these problems, namely, a risk minimization framework and density ratio estimation framework. Finally, we demonstrate the effectiveness of the proposed frameworks and compare both frameworks through experiments using benchmark datasets.
翻译:从正和未贴标签数据(PU分类)中出现的二进制分类瓶颈是要求从与测试分布相同的分布中得出未贴标签的模式,但在实践中往往没有达到这一要求。在本文中,我们将PU分类普遍化为前轮班级,即未贴标签模式之前的类别不同于未贴标签模式的类别。根据对Bayes最佳分类师的分析,我们显示,在先轮班级下,按测试类别分类的PU分类等同于非对称错误的PU分类,因为错误的正罚和负罚不同。然后,我们提出两个框架来处理这些问题,即风险最小化框架和密度比率估计框架。最后,我们通过使用基准数据集进行实验,展示拟议框架的有效性并比较这两个框架。