Penalized logistic regression is extremely useful for binary classification with large number of covariates (higher than the sample size), having several real life applications, including genomic disease classification. However, the existing methods based on the likelihood loss function are sensitive to data contamination and other noise and, hence, robust methods are needed for stable and more accurate inference. In this paper, we propose a family of robust estimators for sparse logistic models utilizing the popular density power divergence based loss function and the general adaptively weighted LASSO penalties. We study the local robustness of the proposed estimators through its influence function and also derive its oracle properties and asymptotic distribution. With extensive empirical illustrations, we demonstrate the significantly improved performance of our proposed estimators over the existing ones with particular gain in robustness. Our proposal is finally applied to analyse four different real datasets for cancer classification, obtaining robust and accurate models, that simultaneously performs gene selection and patient classification.
翻译:罚函数 Logistic 回归对于具有大量特征(比样本数还要多)的二元分类非常有用,包括基因组疾病分类等几个真实世界的应用。然而,现有的基于似然损失函数的方法对数据污染和其他噪声非常敏感,因此需要稳健方法以实现更稳定和准确的推断。在本文中,我们提出了一类稀疏逻辑模型的稳健估计器家族,利用了流行的密度能量分歧基于的损失函数以及一般的自适应加权 LASSO 惩罚。通过影响函数研究所提出的估计器的局部鲁棒性,并推导出其奥拉克属性和渐近分布。通过大量的经验演示,我们展示了我们提出的估计器相对于现有估计器的显着性能提高,特别是在鲁棒性上的增益。最后,我们将所提出的方法应用于分析癌症分类的四个不同真实数据集,获得稳健且准确的模型,同时执行基因选择和患者分类。