Many modern datasets are collected automatically and are thus easily contaminated by outliers. This led to a regain of interest in robust estimation, including new notions of robustness such as robustness to adversarial contamination of the data. However, most robust estimation methods are designed for a specific model. Notably, many methods were proposed recently to obtain robust estimators in linear models (or generalized linear models), and a few were developed for very specific settings, for example beta regression or sample selection models. In this paper we develop a new approach for robust estimation in arbitrary regression models, based on Maximum Mean Discrepancy minimization. We build two estimators which are both proven to be robust to Huber-type contamination. We obtain a non-asymptotic error bound for one them and show that it is also robust to adversarial contamination, but this estimator is computationally more expensive to use in practice than the other one. As a byproduct of our theoretical analysis of the proposed estimators we derive new results on kernel conditional mean embedding of distributions which are of independent interest.
翻译:许多现代数据集是自动收集的,因此很容易被外部线虫污染。 这导致人们重新对稳健估算感兴趣, 包括新的稳健度概念, 如对数据对抗性污染的稳健性概念。 但是, 多数稳健估算方法是为特定模型设计的。 值得注意的是, 最近提出了许多方法, 以便在线性模型( 或通用线性模型)中获取稳健的估测器, 一些方法是为非常具体的环境开发的, 例如 贝塔回归模型 或样本选择模型 。 在本文中, 我们开发了一种新的方法, 在任意回归模型中进行稳健估算, 其依据是最大平均值差异最小化 。 我们建造了两个被证明对Huber 型污染具有稳健性的测算器。 我们得到了一个非稳健的误差, 并表明它对于对抗性污染来说也是稳健的, 但这个估测器比其他模型在实际中使用的成本要高得多。 作为我们对拟议估测算器进行理论分析的副产品, 我们根据最大中值差异最小值, 在有条件的嵌入分布中产生新结果, 具有独立的兴趣 。