限定伪计量空间的固定和适应性地标集 (Fixed and adaptive landmark sets for finite pseudometric spaces)

Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful information from high-dimensional, noisy, and complex data such as those used in biomedicine. To operate efficiently, these techniques may employ landmark samplers, either random or heuristic. The heuristic maxmin procedure obtains a roughly even distribution of sample points by implicitly constructing a cover comprising sets of uniform radius. However, issues arise with data that vary in density or include points with multiplicities, as are common in biomedicine. We propose an analogous procedure, "lastfirst" based on ranked distances, which implies a cover comprising sets of uniform cardinality. We first rigorously define the procedure and prove that it obtains landmarks with desired properties. We then perform benchmark tests and compare its performance to that of maxmin, on feature detection and class prediction tasks involving simulated and real-world biomedical data. Lastfirst is more general than maxmin in that it can be applied to any data on which arbitrary (and not necessarily symmetric) pairwise distances can be computed. Lastfirst is more computationally costly, but our implementation scales at the same rate as maxmin. We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks. Where the numerical values of similarity measures are not meaningful, as in many biomedical contexts, lastfirst sampling may also improve interpretability.

翻译：地形数据分析(TDA)是一个不断扩大的领域,它利用代数表层学的原则和工具来量化数据集的结构特征或将其转化为更易于管理的形式。随着其理论基础的发展,TDA在从高维、吵闹和复杂的数据(如生物医学中所使用的数据)中提取有用信息方面显示了希望。为了高效运作,这些技术可以使用标志性取样器,无论是随机的还是超光速的。超光速最大化程序通过隐含构建由一组统一半径组成的覆盖物来获得大致均衡的抽样点分布。然而,随着生物医学的常见情况,数据密度不同或包含多功能点的数据出现问题。我们建议了一个类似程序,即“最后一级”基于排位距离的覆盖,意味着由各种统一的基点组成的覆盖。我们首先严格地界定程序,并证明它具有具有符合理想特性的标志性能。我们随后进行基准测试,并将其性能与最大值的性能比较,即包含一系列统一生物物理数据的检测和等级预测。最后一级比值比值比值比值比值比值比值比值更一般得多,因为在任何一级的测算的测算中,我们最后的测算的比值也是最后一级测算的比值也是最后一个比值。我们测算的比值,在最后一级测算的比值的比值可能的比值。