Distances between data points are widely used in point cloud representation learning. Yet, it is no secret that under the effect of noise, these distances-and thus the models based upon them-may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we characterize such effects in high-dimensional data using an asymptotic probabilistic expression. Furthermore, while it has been previously argued that neighborhood queries become meaningless and unstable when there is a poor relative discrimination between the furthest and closest point, we conclude that this is not necessarily the case when explicitly separating the ground truth data from the noise. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be true even when we observe this discrimination to be poor. We include thorough empirical verification of our results, as well as experiments that interestingly show our derived phase shift where neighbors become random or not is identical to the phase shift where common dimensionality reduction methods perform poorly or well for finding low-dimensional representations of high-dimensional data with dense noise.
翻译:数据点之间的距离被广泛用于点云代表学习。然而,在噪音的影响下,这些距离和基于这些距离的模型可能会在高维度上失去作用,这绝非秘密。事实上,噪音的微小边际效应可能会迅速积累,将经验最接近和距离地面最远的邻居转移至远离地面真相的地方。在本文中,我们用无症状概率的表达方式来描述高维数据中的这种效应。此外,虽然以前曾争论过,当最远和最接近的点之间相对差别不大时,邻里查询就变得毫无意义和不稳定,但我们的结论是,在明确将地面真相数据与噪音区分开来时,情况不一定如此。更具体地说,我们发现,在特定条件下,受到噪音影响的经验性邻里关系仍然有可能是真实的,即使我们观察到这种差别是很差的。我们包括对我们的结果进行彻底的经验性核查,以及实验令人感兴趣地表明,在邻居变得随机或不等同于阶段转移的阶段,即共同的维度减少方法表现差或很好地发现高度数据的低维度表现。