Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers.
翻译:离群值经常出现在大数据应用中,可能严重影响统计估计和推断。在本文中,介绍了一个离群值抵抗估计的框架,用于增强任意给定损失函数的鲁棒性。它与修剪(trimming)方法密切相关,并为所有样本提供明确的离群值参数,从而促进计算、理论和参数调整。为了解决非凸性和非光滑性问题,我们开发了可扩展的算法,具有简便的实现和保证的快速收敛性。特别地,提出了一种新技术来缓解对起始点的要求,以便在常规数据集上,可以大幅度减少数据重新取样的数量。基于统计和计算的综合处理,我们能够进行M-估计之外的非渐近分析。虽然获得的抵抗性估计量不一定是全局或甚至是局部最优的,但在低维度和高维度中都享有最小化最大风险的优越性。回归、分类和神经网络的实验表明,所提出的方法在严重离群值存在的情况下具有出色的性能。