Classification with positive and unlabeled (PU) data frequently arises in bioinformatics, clinical data, and ecological studies, where collecting negative samples can be prohibitively expensive. While prior works on PU data focus on binary classification, in this paper we consider multiple positive labels, a practically important and common setting. We introduce a multinomial-PU model and an ordinal-PU model, suited to unordered and ordered labels respectively. We propose proximal gradient descent-based algorithms to minimize the l_{1,2}-penalized log-likelihood losses, with convergence guarantees to stationary points of the non-convex objective. Despite the challenging non-convexity induced by the presence-only data and multi-class labels, we prove statistical error bounds for the stationary points within a neighborhood around the true parameters under the high-dimensional regime. This is made possible through a careful characterization of the landscape of the log-likelihood loss in the neighborhood. In addition, simulations and two real data experiments demonstrate the empirical benefits of our algorithms compared to the baseline methods.
翻译:摘要:在生物信息学、临床数据和生态学研究中,正样本和未标记(PU)数据的分类经常出现,其中收集负样本可能代价高昂。虽然以前的PU数据研究集中在二元分类上,但在本文中我们考虑了有多个正标签的情况,这是一个实际上重要且常见的情况。我们引入了一个适用于无序标签的多项式-PU模型和适用于有序标签的序数-PU模型。我们提出了基于近端梯度下降的算法来最小化l_{1、2}-惩罚的对数损失,其在非凸目标的稳定点上具有收敛保证。尽管存在性数据和多类别标签引起了挑战性的非凸性,但我们证明了在高维度情况下真实参数的稳态点周围的统计误差边界。这是通过对邻域内对数损失的景观进行仔细的特征化而实现的。此外,模拟和两个真实数据实验证明了与基线方法相比我们算法的实证收益。