In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $\theta$ of a square integrable r.v. $Z$, around which accurate nonasymptotic confidence bounds can be built, even when $Z$ does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM's ability to deal with corrupted data. In this context, the present work proposes a general study of MoM's concentration properties under the contamination regime, that provides a clear understanding of the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) $U$-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.
翻译:与经验的平均值相反,Media-mean-means(MoMM)是计算一个平方可移动的r.v.v.z$的平均值$theta$的估算单位,可以围绕这一平均值建立准确的非安非安非他明的信任界限,即使Z$没有显示亚高加索尾巴行为,即使Z$没有显示亚高加索尾巴行为。由于它对于重尾数据的信心很高,Media-mean-means(MoMMM)在机器学习中发现了各种应用,用来设计对非典型观察不敏感的培训程序。最近,新的工作正在试图确定并利用MMMM处理腐败数据的能力。在这方面,目前的工作提议对MMM的污染制度下的集中特性进行一般性研究,从而清楚地了解外部比例的影响和所选择的区数。分析扩大到(倍)美元统计学,即平均观察数据,这增加了因依赖引起的额外挑战。最后,我们提出,在精确的逻辑上,我们提出了一种可靠的方法,即以直截面的方式,学习一种精确的算法。