Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful information from high-dimensional, noisy, and complex data such as those used in biomedicine. To improve efficiency, these techniques may employ landmark samplers. The heuristic maxmin procedure obtains a roughly even distribution of sample points by implicitly constructing a cover comprising sets of uniform radius. However, issues arise with data that vary in density or include points with multiplicities, as are common in biomedicine. We propose an analogous procedure, "lastfirst" based on ranked distances, which implies a cover comprising sets of uniform cardinality. We first rigorously define the procedure and prove that it obtains landmarks with desired properties. We then perform benchmark tests and compare its performance to that of maxmin, on feature detection and class prediction tasks involving simulated and real-world biomedical data. Lastfirst is more general than maxmin in that it can be applied to any data on which arbitrary (and not necessarily symmetric) pairwise distances can be computed. Lastfirst is more computationally costly, but our implementation scales at the same rate as maxmin. We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks. Where the numerical values of similarity measures are not meaningful, as in many biomedical contexts, lastfirst sampling may also improve interpretability.
翻译:地形数据分析(TDA)是一个不断扩大的领域,它利用代数表层学的原则和工具来量化数据集的结构特征,或将其转化为更易于管理的形式。随着它的理论基础已经发展,TDA在从高维、吵闹和复杂的数据(如生物医学中使用的数据)中提取有用信息方面显示了希望。为了提高效率,这些技术可以使用里程碑式的取样器。超纯性最大值程序通过隐含构建由一组统一半径组成的覆盖体来获得大致均衡的抽样点分布。然而,由于数据密度不同或包含多种特性的点(如生物医学中常见的特征)而出现问题。我们建议了一种类似程序,即“最后一级”程序,它基于等级距离,意味着由一套统一性基点构成的封面。我们首先严格地界定程序,并证明它具有与理想性特征的标志性能。然后,我们进行基准测试,并将其性能与峰值比较,涉及模拟和真实世界生物伦理数据的特征检测和类测算任务。最后一级比重,在生物医学数据任意性(不一定是多数的)情况下,在最后一级测算的测算方法上,最后的距离可测算。我们最接近性测算的测算,最后的测算为最后一级,在最后一级测算。我们测算的测算的测算的测算。