High-dimensional learning problems, where the number of features exceeds the sample size, often require sparse regularization for effective prediction and variable selection. While established for fully supervised data, these techniques remain underexplored in weak-supervision settings such as Positive-Confidence (Pconf) classification. Pconf learning utilizes only positive samples equipped with confidence scores, thereby avoiding the need for negative data. However, existing Pconf methods are ill-suited for high-dimensional regimes. This paper proposes a novel sparse-penalization framework for high-dimensional Pconf classification. We introduce estimators using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Theoretically, we establish estimation and prediction error bounds for the L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. To solve the resulting composite objective, we develop an efficient proximal gradient algorithm. Extensive simulations demonstrate that our proposed methods achieve predictive performance and variable selection accuracy comparable to fully supervised approaches, effectively bridging the gap between weak supervision and high-dimensional statistics.
翻译:高维学习问题中,特征数量超过样本量的情况通常需要稀疏正则化以实现有效预测和变量选择。尽管此类技术在完全监督数据上已较为成熟,但在弱监督场景(如正置信度分类)中仍缺乏深入探索。正置信度学习仅利用带有置信度评分的正样本,从而避免了对负数据的需求。然而,现有的正置信度方法难以适应高维场景。本文提出了一种面向高维正置信度分类的新型稀疏惩罚框架。我们引入采用凸惩罚(Lasso)与非凸惩罚(SCAD、MCP)的估计量,以解决收缩偏差并提升特征恢复能力。在理论上,我们建立了L1正则化正置信度估计量的估计误差与预测误差界,证明其在受限强凸性条件下能够达到近乎极小极大最优的稀疏恢复速率。针对由此产生的复合目标函数,我们开发了一种高效近端梯度算法。大量仿真实验表明,所提方法在预测性能与变量选择准确性方面可与完全监督方法相媲美,有效弥合了弱监督学习与高维统计之间的鸿沟。