Persistence diagrams have been widely used to quantify the underlying features of filtered topological spaces in data visualization. In many applications, computing distances between diagrams is essential; however, computing these distances has been challenging due to the computational cost. In this paper, we propose a persistence diagram hashing framework that learns a binary code representation of persistence diagrams, which allows for fast computation of distances. This framework is built upon a generative adversarial network (GAN) with a diagram distance loss function to steer the learning process. Instead of using standard representations, we hash diagrams into binary codes, which have natural advantages in large-scale tasks. The training of this model is domain-oblivious in that it can be computed purely from synthetic, randomly created diagrams. As a consequence, our proposed method is directly applicable to various datasets without the need for retraining the model. These binary codes, when compared using fast Hamming distance, better maintain topological similarity properties between datasets than other vectorized representations. To evaluate this method, we apply our framework to the problem of diagram clustering and we compare the quality and performance of our approach to the state-of-the-art. In addition, we show the scalability of our approach on a dataset with 10k persistence diagrams, which is not possible with current techniques. Moreover, our experimental results demonstrate that our method is significantly faster with the potential of less memory usage, while retaining comparable or better quality comparisons.
翻译:常识图已被广泛用来量化数据可视化中经过滤的表层空间的基本特征。 在许多应用中, 计算图表之间的距离至关重要; 然而, 计算这些距离由于计算成本而具有挑战性 。 在本文中, 我们提议了一个持久性图的散列框架, 来学习持久性图的二进制代号, 从而可以快速计算距离 。 这个框架建在一个带有图示距离损失函数的基因化对角网络( GAN ) 上, 以指导学习过程。 我们不用使用标准质量表示法, 将图解纳入二进制代码中, 这在大型任务中具有自然优势 。 这个模型的训练是域分明的, 因为它可以完全从合成的随机创建的图表中计算出来。 因此, 我们提出的方法可以直接适用于各种数据集, 而无需再培训模型。 这些二进制代码, 与快速的距离相比, 更好地维持数据集之间的表层相似性特性。 为了评估这个方法, 我们用这个框架来分析在大型任务中具有自然优势。 我们的图表分组问题, 我们用的方法和我们比较了域图的可比较性, 以10进度方法 显示我们的方法 的精确的方法, 显示我们的数据的特性和直径的特性, 。