Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
翻译:在数据库中适当地反映要素,以便查询能够准确匹配,这是信息检索的一项核心任务;最近,通过使用各种计量标准,将数据库的图形结构以等级保留方式嵌入成一个多层,从而实现了这一点。持久性同质学是地形数据分析中常用的一种工具,它能够严格地从层次和连接结构两方面对数据库进行定性。在各种嵌入的数据集中计算持久性同质学表明,一些常用的嵌入器未能保护连接。我们通过引入两种变相比较措施,将数据库的图形结构成功地保留在持续同质学中,从而捕捉到这一效果:特别是,它们解决了多元体的参数扭曲问题。我们提供了一种计算方法的算法,其计算方法大大缩短了现有方法的时间复杂性。我们利用这些措施来进行基于地形的信息检索,并展示其相对于持久性同性同质学标准瓶颈距离的更高性能。我们展示了我们在不同数据种类数据库中采用的方法,包括文本、视频和医疗图像。