The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions w.r.t. each distribution. Its good behavior w.r.t. major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in numerical experiments.
翻译:概率分布的度量设计是一个长期问题, 是由机器学习中的多种应用驱动的。 侧重于 Euclidean 空间的连续概率分布 $\ mathb{R ⁇ d$, 我们通过将单数量分布扩展至多变量空间, 引入了一种新颖的概率分布的伪度。 数据深度是一个非参数统计工具, 用来测量( w.r. t.) 概率分布或数据集中任何元素的核心值。 它是一个以累积分布函数( cdf) 为主的自然中位扩展到多变量大小。 因此, 它的上层分布分布分布( $\ mathb{R ⁇ d$d$d$ $ $ ), 我们引入了一个新的伪统计工具, 以基于深度区域分布区域 w.r. t. 之间的平均距离衡量。 它的良好行为 w.r. t. m. 主要的转变组, 以及它从系数转换到多变量的转换能力。 因此, robust cribust 区域 的精确度 和精确度分析方法的特征, 和我们所研究的精确度分析的精确度 的精确度 的精确度 。