Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithms. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activity detection.
翻译:当只有正(P)和无标记(U)数据时,正(PU)学习涉及二进制分类问题。许多基于线性模型和神经网络的PU方法已经提出;然而,对于理论性声推式算法如何与P和U数据相配合,仍然缺乏研究。考虑到在某些情况下,神经网络即使完全监督的数据也不能像推算法一样好,但神经网络不能像推算法那样好,我们建议为PU学习提出新的推算法:Ada-PU,它比照神经网络。Ada-PU遵循AdaBoost的一般程序,而两种不同的P数据分布得到维护和更新。在新更新的分布学到一个弱的分类器后,对最终共性的相应组合权重仅使用PU数据来估计。我们证明,在较小的一组基础分类器中,拟议方法保证保持推算法的理论性能。在实验中,我们显示Ada-PU在基准的PU数据设置上的神经网络网络。我们还研究了一种真实的、高级的检测活动。