Data depth, introduced by Tukey (1975), is an important tool in data science, robust statistics, and computational geometry. One chief barrier to its broader practical utility is that many common measures of depth are computationally intensive, requiring on the order of $n^d$ operations to exactly compute the depth of a single point within a data set of $n$ points in $d$-dimensional space. Often however, we are not directly interested in the absolute depths of the points, but rather in their relative ordering. For example, we may want to find the most central point in a data set (a generalized median), or to identify and remove all outliers (points on the fringe of the data set with low depth). With this observation, we develop a novel and instance-adaptive algorithm for adaptive data depth computation by reducing the problem of exactly computing $n$ depths to an $n$-armed stochastic multi-armed bandit problem which we can efficiently solve. We focus our exposition on simplicial depth, developed by Liu (1990), which has emerged as a promising notion of depth due to its interpretability and asymptotic properties. We provide general instance-dependent theoretical guarantees for our proposed algorithms, which readily extend to many other common measures of data depth including majority depth, Oja depth, and likelihood depth. When specialized to the case where the gaps in the data follow a power law distribution with parameter $\alpha<2$, we show that we can reduce the complexity of identifying the deepest point in the data set (the simplicial median) from $O(n^d)$ to $\tilde{O}(n^{d-(d-1)\alpha/2})$, where $\tilde{O}$ suppresses logarithmic factors. We corroborate our theoretical results with numerical experiments on synthetic data, showing the practical utility of our proposed methods.
翻译:由 Tukey (1975年) 引入的数据深度是数据科学、 稳健的统计和计算性几何中的一个重要工具。 与其更广泛的实际用途相比, 一个主要的障碍是, 许多共同的深度测量是计算密集的, 需要以美元为单位的操作来精确地计算一个单一点的深度, 在以美元为单位的一组数据中, 以美元为单位计算一个点的深度。 然而, 通常, 我们并不直接关注点的绝对深度, 而是它们的相对顺序。 例如, 我们可能想要在数据集中找到最核心的点( 通用中位值 ), 或者找出并移除所有外端( 数据位于低深度数据的边缘点 ) 。 通过观察, 我们开发了一个创新的、 例化的算法性算算法, 将精确的深度降低我们提议的 美元 深度, 将我们提议的多条形形的多条纹多条纹的深度数据 。 我们用时间定位的深度, 用刘 (1990年), 将我们开发的精度深度定位定位的深度, 将它作为一个很有深度的概念概念概念,, 因为它的深度, 将显示它的深度, 将降低的深度的深度 的深度, 的深度, 将显示我们的数据 的深度 的深度 的深度, 的深度 将显示为 我们的精确的深度, 的深度, 的深度, 的深度, 数据将显示的深度 常规的深度, 数据 的 的 的 的深度, 数据 的深度将显示为 。