Modern datasets are characterized by a large number of features that may conceal complex dependency structures. To deal with this type of data, dimensionality reduction techniques are essential. Numerous dimensionality reduction methods rely on the concept of intrinsic dimension, a measure of the complexity of the dataset. In this article, we first review the TWO-NN model, a likelihood-based intrinsic dimension estimator recently introduced in the literature. The TWO-NN estimator is based on the statistical properties of the ratio of the distances between a point and its first two nearest neighbors, assuming that the points are a realization from an homogeneous Poisson point process. We extend the TWO-NN theoretical framework by providing novel distributional results of consecutive and generic ratios of distances. These distributional results are then employed to derive intrinsic dimension estimators, called Cride and Gride. These novel estimators are more robust to noisy measurements than the TWO-NN and allow the study of the evolution of the intrinsic dimension as a function of the scale used to analyze the dataset. We discuss the properties of the different estimators with the help of simulation scenarios.
翻译:现代数据集具有许多特征,这些特征可能隐藏复杂的依赖结构。在处理这类数据时,维度减少技术是必不可少的。许多维度减少方法依赖于内在维度的概念,这是衡量数据集复杂性的一个尺度。在本篇文章中,我们首先审查基于可能性的内在维度估计器二-NN模型,这是文献中最近引入的一种基于可能性的内在维度估计器。 2-NE 估计器基于一个点与其前两个近邻之间距离的统计属性,假设这些点是从同质 Poisson 点进程中实现的。我们通过提供连续和通用距离比重的新分布结果来扩展二-NNN理论框架。然后,这些分布结果被用来产生内在维度估计器,称为Cride和Gride。这些新的估计器比 2-NN 更能进行噪音测量,并允许研究内在维度的演变,作为分析数据集的尺度的函数。我们讨论不同估计器的特性,并借助模拟假设。