Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for `tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
翻译:精确估计自然维度(ID)在许多数据挖掘和机器学习任务中至关重要,包括减少维度、异端探测、相似搜索和子空间群集,但是,由于趋同一般要求以数百点的顺序进行抽样大小(即邻里大小),现有的ID估计方法对数据由许多小块自然群组成的应用的用处可能有限。在本文中,我们提议一个本地ID估计战略稳定,即使是“近似”地点的“近似”取样点也只有20个。估测员根据最新的内在维度极端价值理论模型(LID),对抽样成员之间所有可用的双向距离,即局部内在维度模型(LID),应用MLE技术。我们的实验结果表明,我们提议的估算技术可以显著缩小差异,同时保持相似的偏差程度,其抽样大小要小得多,远小于州级估测算员。