The manifold hypothesis, which assumes that data lie on or close to an unknown manifold of low intrinsic dimensionality, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibit distinct non-manifold structures, which result in singularities that can lead to erroneous conclusions about the data. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address detecting singularities by developing (i) persistent local homology, a new topology-driven framework for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a topology-based multi-scale measure for assessing the 'manifoldness' of individual points. We show that our approach can reliably identify singularities of complex spaces, while also capturing singular structures in real-world data sets.
翻译:多重假设假设认为数据存在于或接近于一个未知的内分维度低的多元体,是现代机器学习研究的主要内容,然而,最近的工作表明,现实世界数据呈现出独特的非玩偶结构,从而导致对数据得出错误结论的奇特性。因此,发现这种奇特性作为内推和推论任务的前奏至关重要。我们通过开发(一) 持久性的本地同质学和新的由地形学驱动的对本地数据集内在层面进行量化的新框架,以及(二) 以地表学为基础的评估单个点的“非玩偶性”的多尺度措施。我们表明,我们的方法可以可靠地识别复杂空间的奇特性,同时捕捉到现实世界数据集中的奇特结构,从而发现奇特性。