" 诅咒被重新审视:何时在闻名性高差异数据中,远程信息能够反映真实真相? (The Curse Revisited: When are Distances Informative for the Ground Truth in Noisy High-Dimensional Data?)

Distances between data points are widely used in machine learning. Yet, when corrupted by noise, these distances -- and thus the models based upon them -- may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we exactly characterize such effects in noisy high-dimensional data using an asymptotic probabilistic expression. Furthermore, while it has previously been argued that neighborhood queries become meaningless and unstable when distance concentration occurs, meaning that there is a poor relative discrimination between the furthest and closest neighbors in the data, we conclude that this is not necessarily the case when we decompose the data in a ground truth -- which we aim to recover -- and noise component. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be truthful even when distance concentration occurs. We include thorough empirical verification of our results, as well as interesting experiments in which our derived phase shift where neighbors become random or not turns out to be identical to the phase shift where common dimensionality reduction methods perform poorly or well for recovering low-dimensional reconstructions of high-dimensional data with dense noise.

翻译：数据点之间的距离在机器学习中被广泛使用。然而,当被噪音破坏时,这些距离 -- -- 以及以它们为基础的模型 -- -- 可能会在高维方面失去作用。事实上,噪音的微小边际效应可能会迅速积累,将最接近和距离地面最远的邻居转移开来。在本文中,我们精确地用无症状概率的表达方式将这种效应描述在吵闹的高维数据中。此外,虽然以前曾指出,当距离集中发生时,邻居的问询变得毫无意义和不稳定,意味着数据中最远和最近的邻居之间的相对差别不大,我们的结论是,当我们将数据分解成地面真相 -- -- 我们的目标是要恢复的真相 -- -- 和噪音部分 -- -- 时,情况不一定如此。更具体地说,我们的结论是,在特定条件下,即使距离集中时,受到噪音影响的经验性社区关系仍然可能是真实的。我们包括彻底的经验性核查我们的结果,以及有趣的实验,即我们衍生的相邻的阶段转移是随机的,或者没有结果与相仿照相近的阶段转移一样,即共同的降低维度的方法与恢复高度的密度的高度的频率的平反。