The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to achieve strong robust estimation guarantees under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
翻译:瓦斯特斯坦距离植根于最佳运输(OT)理论,是衡量统计和机器学习各种应用的概率分布之间概率分布的流行度量。尽管瓦斯特斯坦距离结构丰富且具有实用性,但瓦斯特斯坦距离对于考虑的分布模式的外部值十分敏感,这妨碍了实际适用性。我们提议了一个新的外部紫色瓦斯特尔斯坦距离 $\ mathsfsf{W<unk> p<unk> p<unk> varepsilon$,允许从每个被污染的分布中去除美元(varepsilon)外值。在标准时刻假设下,美元(mathsfsf{W<unk> p<unk> p<unk> varepsilon$)被显示为在“Huber $\varepslon$- contailation ” 模式下实现强有力的估算保证。我们这种强度距离的配方相当于一个非常经常的优化问题,比以前考虑的框架更便于分析。我们利用这一理论,对美元(max)的模型值进行彻底的理论研究,包括强度保障,对最佳的过量应用进行定性、定期性、双重性和双重性、双重性和统计化的估算。我们用一个简单的变制的变制的变制数据。</s>