We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. In particular, we aim to circumvent the curse of dimensionality in Wasserstein distance: when the dimension is high, it has diminishing testing power, which is inherently due to the slow concentration property of Wasserstein metrics in the high dimension space. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions. We characterize the theoretical property of the finite-sample convergence rate on IPMs and present practical algorithms for computing this metric. Numerical examples validate our theoretical results.
翻译:我们为两样抽样测试开发了预测的瓦森斯坦距离,这是统计和机器学习中的一个根本问题:给两套样本,以确定它们是否来自同一分布。特别是,我们的目标是绕过瓦森斯坦距离的维度诅咒:当维度高时,它具有越来越弱的测试能力,这必然是由于高维度空间瓦森斯坦指标的浓度属性缓慢。一个关键的贡献是将最佳的投影组合到一起,找到低维度线性绘图,以最大限度地扩大瓦森斯坦预测概率分布之间的距离。我们描述IPM有限抽样合并率的理论属性,并提出计算该参数的实用算法。数字实例证实了我们的理论结果。