Distances between data points are widely used in machine learning applications. Yet, when corrupted by noise, these distances -- and thus the models based upon them -- may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we exactly characterize such effects in noisy high-dimensional data using an asymptotic probabilistic expression. Previously, it has been argued that neighborhood queries become meaningless and unstable when distance concentration occurs, which means that there is a poor relative discrimination between the furthest and closest neighbors in the data. However, we conclude that this is not necessarily the case when we decompose the data in a ground truth -- which we aim to recover -- and noise component. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be truthful even when distance concentration occurs. We also include thorough empirical verification of our results, as well as interesting experiments in which our derived `phase shift' where neighbors become random or not turns out to be identical to the phase shift where common dimensionality reduction methods perform poorly or well for recovering low-dimensional reconstructions of high-dimensional data with dense noise.
翻译:数据点之间的距离在机器学习应用中被广泛使用。然而,当被噪音破坏时,这些距离 -- -- 以及以它们为基础的模型 -- -- 可能会在高维方面失去作用。事实上,噪音的微小边际效应可能会迅速积累,将经验最接近和距离最远的邻居转移至远离地面的真相。在本文中,我们精确地用无症状概率表达法来描述噪音高维数据中的这种效应。以前,人们曾认为,当距离集中发生时,邻居的问询变得毫无意义和不稳定,这意味着数据中最远和最近的邻居之间的相对差别不大。然而,我们的结论是,当我们将数据分解成地面真相 -- -- 我们的目标是要恢复的 -- -- 和噪音组成部分 -- -- 时,情况不尽如人意。更具体地说,在特定条件下,即使发生距离集中时,受到噪音影响的经验邻里关系仍有可能是真实的。我们还包括对结果的彻底经验核查,以及有趣的实验,我们从中得出的“阶段转移”是随机的,或邻居之间的相对区别不大。但我们的结论是,这不一定与普通的降低空间的高度重建方法的高度。