Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of attempting to transform diagrams into vectorized representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need of retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with less memory usage, while retaining comparable or better quality comparisons.
翻译:常识图已被广泛用于量化数据可视化中经过滤的表层空间的基本特征。 在许多应用中, 计算图表之间的距离至关重要; 然而, 计算这些距离由于计算成本而具有挑战性 。 在本文中, 我们提议了一个耐久图散列框架, 来学习持久性图的二进制代号, 从而可以快速计算距离。 这个框架建在带有图示距离丢失功能的基因化对称网络上, 以指导学习过程。 我们没有试图将图表转换为矢量化表达式, 而是将图表转换成二进制代码, 在大型任务中具有自然优势。 但是, 计算这些图的距离是具有挑战性的, 因为它可以完全从合成的随机创建的图表中计算。 因此, 我们提出的方法可以直接适用于各种数据集, 而不需要再培训模型。 这些二进制代码, 与快速的Hamming距离相比, 更好地维持数据集之间的表面相似性属性。 为了评估这个方法, 我们不应用我们的框架, 我们的当前框架, 使用更精确的内径方法, 我们的内存方法, 我们用更精确的内径的方法 展示了我们的数据质量和直观性, 我们用的方法 展示了我们更精确的直观的方法 。