The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. Inspired by the Huber contamination model, we propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Our formulation amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing characterization of optimal perturbations, regularity, duality, and statistical estimation and robustness results. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the benefits of our framework via applications to generative modeling with contaminated datasets.
翻译:瓦塞斯坦距离植根于最佳运输理论,是衡量统计和机器学习各种应用的概率分布之间流行的差别的尺度。尽管瓦塞斯坦距离结构丰富,而且证明是有用的,但瓦塞斯坦距离对考虑的分布的异端非常敏感,这妨碍了实际的适用性。在Huber污染模型的启发下,我们建议采用一个新的外端-紫色瓦塞斯坦距离 $\ mathsf{W ⁇ p ⁇ varepsilon$,允许从每个被污染的分布中去除美元等值。我们的配方相当于一个非常经常的优化问题,比以前考虑的框架更便于分析。我们利用这一点,对美元进行彻底的理论研究,包括对最佳扰动性、规律性、双重性、统计估计和稳健性结果的定性。我们通过分解最优化变量,就每个被污染的分布而言,我们达到了一个简单的双重形式。我们通过基本修改的基因模型,通过我们的数据应用,通过基本修改到标准的基因模型,可以实现。