We propose new tools for the geometric exploration of data objects taking values in a general separable metric space $(\Omega, d)$. Given a probability measure on $\Omega$, we introduce depth profiles, where the depth profile of an element $\omega\in\Omega$ refers to the distribution of the distances between $\omega$ and the other elements of $\Omega$. Depth profiles can be harnessed to define transport ranks, which capture the centrality of each element in $\Omega$ with respect to the entire data cloud based on optimal transport maps between depth profiles. We study the properties of transport ranks and show that they provide an effective device for detecting and visualizing patterns in samples of random objects and also entail notions of transport medians, modes, level sets and quantiles for data in general separable metric spaces. Specifically, we study estimates of depth profiles and transport ranks based on samples of random objects and establish the convergence of the empirical estimates to the population targets using empirical process theory. We demonstrate the usefulness of depth profiles and associated transport ranks and visualizations for distributional data through a sample of age-at-death distributions for various countries, for compositional data through energy usage for U.S. states and for network data through New York taxi trips.
翻译:我们提出了用于探索取值在一般可分度量空间$(\Omega,d)$中的数据对象的新工具。给定$\Omega$上的概率测度,我们引入了深度轮廓(depth profiles),其中$\omega\in\Omega$的深度轮廓是指与$\omega$和$\Omega$中其他元素之间的距离的分布。可以利用深度轮廓来定义传输排名(transport ranks),传输排名基于深度轮廓之间的最优传输映射,捕捉每个元素相对于整个数据云的中心性。我们研究了传输排名的性质,并证明它们是一种有效的检测和可视化随机对象样本中模式的工具,同时包含了一般可分度量空间数据的传输中位数(transport medians),模式,水平集(level sets)和分位数的概念。具体来说,我们研究了基于随机对象样本的深度轮廓和传输排名的估计,并使用经验过程理论证明了经验估计量收敛到总体目标。我们通过各国死亡年龄分布的样本、美国各州的能源使用和纽约出租车旅行的网络数据等示例,演示了深度轮廓和相关传输排名以及可视化工具在分布数据、组成数据和网络数据中的有用性。