The curse of dimensionality is a phenomenon frequently observed in machine learning (ML) and knowledge discovery (KD). There is a large body of literature investigating its origin and impact, using methods from mathematics as well as from computer science. Among the mathematical insights into data dimensionality, there is an intimate link between the dimension curse and the phenomenon of measure concentration, which makes the former accessible to methods of geometric analysis. The present work provides a comprehensive study of the intrinsic geometry of a data set, based on Gromov's metric measure geometry and Pestov's axiomatic approach to intrinsic dimension. In detail, we define a concept of geometric data set and introduce a metric as well as a partial order on the set of isomorphism classes of such data sets. Based on these objects, we propose and investigate an axiomatic approach to the intrinsic dimension of geometric data sets and establish a concrete dimension function with the desired properties. Our model for data sets and their intrinsic dimension is computationally feasible and, moreover, adaptable to specific ML/KD-algorithms, as illustrated by various experiments.
翻译:在机器学习(ML)和知识发现(KD)中经常观察到的是一种对维度的诅咒现象。有大量文献用数学和计算机科学的方法来调查其起源和影响。在数据维度的数学洞察中,维度的诅咒和测量集中现象之间有着密切的联系,使前者能够使用几何分析方法。目前的工作根据格罗莫夫的计量几何测量法和佩斯托夫对内在维度的对立法,对数据集的内在几何学进行了全面研究。我们详细界定了几何数据集的概念,并在这类数据集的无形态类集中引入了计量法和部分顺序。根据这些天体,我们提出并调查了对几何数据集内在维度的不言理方法,并确定了与所期望的特性相关的具体维度功能。我们的数据集模型及其内在维度模型是可计算可行的,此外,我们还根据各种实验所显示的具体ML/KD-algorithms进行了调整。