Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.
翻译:外观或异常是数据分析中的一项重要任务。 我们从几何角度讨论这一问题,并提供一个利用数据集的计量结构的框架。 我们的方法基于多方面的假设,即观测到的、名义上高维的数据位于一个低维的多维上,而这一内在结构可以用多种学习方法推断出来。 我们表明,利用这一结构大大改进了在高维数据中测得外向观测的发现。 我们还建议根据数据方的几何和地形学,对分布式和结构式外源进行新颖的、数学精确和广泛适用的区分,以澄清整个文献中普遍存在的概念模糊性。 我们的实验侧重于功能性数据,作为结构性高维数据的一个类别,但我们提出的框架是完全一般性的,我们提出的框架包括图像和图表数据应用。 我们的结果表明,高维和非表层数据的外源结构可以使用多种学习方法加以检测和可视化。我们的结果是,使用对多元嵌入矢量应用的标准外部评分法加以量化。