The need to learn from positive and unlabeled data, or PU learning, arises in many applications and has attracted increasing interest. While random forests are known to perform well on many tasks with positive and negative data, recent PU algorithms are generally based on deep neural networks, and the potential of tree-based PU learning is under-explored. In this paper, we propose new random forest algorithms for PU-learning. Key to our approach is a new interpretation of decision tree algorithms for positive and negative data as \emph{recursive greedy risk minimization algorithms}. We extend this perspective to the PU setting to develop new decision tree learning algorithms that directly minimizes PU-data based estimators for the expected risk. This allows us to develop an efficient PU random forest algorithm, PU extra trees. Our approach features three desirable properties: it is robust to the choice of the loss function in the sense that various loss functions lead to the same decision trees; it requires little hyperparameter tuning as compared to neural network based PU learning; it supports a feature importance that directly measures a feature's contribution to risk minimization. Our algorithms demonstrate strong performance on several datasets. Our code is available at \url{https://github.com/puetpaper/PUExtraTrees}.
翻译:需要从正和未贴标签的数据中学习,或PU学习,这在许多应用中产生,并引起越来越多的兴趣。虽然随机森林已知以正和负数据很好地完成许多任务,但最近的 PU 算法通常以深神经网络为基础,而基于树的PU 学习潜力探索不足。在本文中,我们提出了用于PU学习的新的随机森林算法。我们的方法的关键是对正和负数据决策树算法的新解释,如\emph{recursive 贪婪风险最小化算法}。我们把这一视角扩展至 PU 设置,以开发新的决策树学习算法,从而直接将基于PU- data的预测算法降到最低,从而使我们能够开发高效的PU随机森林算法,PU额外树的潜力。我们的方法有三种可取的属性:对于选择损失函数来说是稳健的,因为各种损失函数导致相同的决定树;它需要与基于 NAU 学习的神经网络相比,几乎没有超参数调;它支持一个特性的重要性,直接测量地测量一个特性的特性,可以测量/ apexpubsal exal ex salmasal sal sal sal sal sal sal sal sals sals supalsus.