Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
翻译:具有持久性的同系物是一种重要的方法,它来自从代数表层学理论到数据设置的表层数据分析,它使理论从代数表层学到数据设置的理论,并在许多应用中得到了成功实施。它以耐久性图的形式产生统计摘要,该图反映了数据的形状和大小。尽管使用广泛,但如果数据集非常庞大,持久性同系物则根本无法实施。在本文件中,我们处理的是为令人望而却步的大型数据集寻找具有代表性的持久性图表的问题。我们调整了典型的靴状统计方法,即从大型数据集中绘制和研究较小的多个子样板。我们显示,子样样本的持久性图的平均值 -- -- 作为一种从子样本中计算得出的平均耐久性测量值 -- -- 是大数据集真正持久性同系法的有效近似值。我们用直观的坚持性图表与每个子样集的数量和大小的真正持续性图表的趋同度图的趋同率率。我们调整了在持续性同质性同质学方法中的耐久性图状和稳定性特性特性,我们用在持续性图状图状的空格上所使用的粘积和稳定性图状图状图状模型模型模型,我们用在大数据模型模型模型模型模型模型模型中可以同时取得。