We propose a general approach to handle data contaminations that might disrupt the performance of feature selection and estimation procedures for high-dimensional linear models. Specifically, we consider the co-occurrence of mean-shift and variance-inflation outliers, which can be modeled as additional fixed and random components, respectively, and evaluated independently. Our proposal performs feature selection while detecting and down-weighting variance-inflation outliers, detecting and excluding mean-shift outliers, and retaining non-outlying cases with full weights. Feature selection and mean-shift outlier detection are performed through a robust class of nonconcave penalization methods. Variance-inflation outlier detection is based on the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination -- which allows the number of features to exponentially increase with the sample size -- and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Computationally efficient heuristic procedures are also presented. We illustrate the finite-sample performance of our proposal through an extensive simulation study and a real-world application.
翻译:我们提出了处理数据污染的通用方法,这些污染可能会破坏高维线性模型特征选择和估计程序的性能。 具体地说,我们考虑中位变换和通胀差异外端同时发生,这些外端可分别作为额外的固定和随机元件建模,并进行独立评估。我们的建议在探测和缩小加权通胀差异外端的同时进行特征选择,探测和排除中位变位外端,并完全保留非外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向,通过稳态选择外向外向外向外的外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向,其外向外的外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外的外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外向外的