Audio fingerprinting systems must efficiently and robustly identify query snippets in an extensive database. To this end, state-of-the-art systems use deep learning to generate compact audio fingerprints. These systems deploy indexing methods, which quantize fingerprints to hash codes in an unsupervised manner to expedite the search. However, these methods generate imbalanced hash codes, leading to their suboptimal performance. Therefore, we propose a self-supervised learning framework to compute fingerprints and balanced hash codes in an end-to-end manner to achieve both fast and accurate retrieval performance. We model hash codes as a balanced clustering process, which we regard as an instance of the optimal transport problem. Experimental results indicate that the proposed approach improves retrieval efficiency while preserving high accuracy, particularly at high distortion levels, compared to the competing methods. Moreover, our system is efficient and scalable in computational load and memory storage.
翻译:音频指纹系统必须在一个广泛的数据库中高效和有力地识别查询片段。 为此,最先进的系统使用深层次的学习来生成紧凑的音频指纹。 这些系统采用索引方法,以不受监督的方式将指纹量化成散列编码,以加快搜索速度。 但是,这些方法产生了不平衡的散列编码,导致其不最佳性能。 因此,我们提议了一个自监督的学习框架,以端到端的方式计算指纹和平衡散列编码,以达到快速和准确的检索性能。 我们把代码建成一个平衡的集成过程,我们视之为最佳运输问题的例子。 实验结果显示,拟议的方法提高了检索效率,同时保持了很高的准确性,特别是在高度扭曲性能上,与相互竞争的方法相比。 此外,我们的系统在计算负荷和内存存储方面是高效和可缩放的。