Most of existing audio fingerprinting systems have limitations to be used for high-specific audio retrieval at scale. In this work, we generate a low-dimensional representation from a short unit segment of audio, and couple this fingerprint with a fast maximum inner-product search. To this end, we present a contrastive learning framework that derives from the segment-level search objective. Each update in training uses a batch consisting of a set of pseudo labels, randomly selected original samples, and their augmented replicas. These replicas can simulate the degrading effects on original audio signals by applying small time offsets and various types of distortions, such as background noise and room/microphone impulse responses. In the segment-level search task, where the conventional audio fingerprinting systems used to fail, our system using 10x smaller storage has shown promising results. Our code and dataset are available at \url{https://mimbres.github.io/neural-audio-fp/}.
翻译:现有的大多数音频指纹系统都对大规模高规格音频检索有局限性。 在这项工作中,我们从一个短的音频单元段产生低维代表,并将这个指纹与快速最大的内产物搜索相配。为此,我们提出了一个取自分层搜索目标的对比式学习框架。每次培训更新都使用一组由一套假标签、随机选择的原始样本及其扩增复制品组成的批次。这些复制品可以通过应用小时间偏移和各种扭曲,例如背景噪音和室/室/话脉冲反应,模拟对原始音频信号的有辱人格影响。在段级搜索任务中,我们使用10x小存储器的常规音频指纹系统失败,我们使用的系统显示了有希望的结果。我们的代码和数据集可以在\ur{https://mimbres.github.io/neural-udio-fp/}查阅。