Data depth is a non parametric statistical tool that measures centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Consequently, its upper level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. In this work, we propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions. The first one is constructed as the Lp-distance between data depth w.r.t. each distribution while the second one relies on the Hausdorff distance between their quantile regions. It can further be seen as an original way to extend the one-dimensional formulae of the Wasserstein distance, which involves quantiles and cdfs, to the multivariate space. After discussing the properties of these pseudo-metrics and providing conditions under which they define a distance, we highlight similarities with the Wasserstein distance. Interestingly, the derived non-asymptotic bounds show that in contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality. Moreover, based on the support function of a convex body, we propose an efficient approximation possessing linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in experiments. Furthermore, by construction the regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails, a behavior witnessed in the numerical experiments.
翻译:数据深度是一个非参数统计工具, 用来测量任何元素 $x\ in\ mathbb{R ⁇ d$ 在( w.r.t.) 概率分布或数据集中的核心位置。 这是多变量案例中累积分布函数( cdf) 的自然中位扩展。 因此, 其上层组 -- -- 深度三角区域 -- -- 产生多变量定义。 在这项工作中, 我们建议根据数据深度及其相关中心区域来测量连续概率测量值的两种新的假参数。 第一个是数据质量分布( w.r.t.) 概率分布之间的 Lp- 距离, 而第二个分布则依赖于数据分布区域之间的累积分布( cdf) 中中中位偏向中位扩展。 因此, 其上层( 深度三角区域) 将瓦瑟斯坦距离的一维度公式扩展至多变量空间。 在讨论这些伪度的特性和提供它们定义距离的条件后, 我们强调与数据质量深度方法的里程( w.r. t. t.