An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This paper presents an audio retrieval system that generates noise and reverberation robust audio fingerprints using the contrastive learning framework. Using these fingerprints, the method performs a comprehensive search to identify the query audio and precisely estimate its timestamp in the reference audio. Our framework involves training a CNN to maximize the similarity between pairs of embeddings extracted from clean audio and its corresponding distorted and time-shifted version. We employ a channel-wise spectral-temporal attention mechanism to better discriminate the audio by giving more weight to the salient spectral-temporal patches in the signal. Experimental results indicate that our system is efficient in computation and memory usage while being more accurate, particularly at higher distortion levels, than competing state-of-the-art systems and scalable to a larger database.
翻译:理想的音频检索系统能高效和有力地识别来自广泛数据库的简短查询片段。 但是,众所周知的音频指纹系统的性能在高度信号扭曲水平上不尽人意。 本文展示了一个音频检索系统, 利用对比性学习框架生成噪音和回声强的音频指纹。 使用这些指纹, 该方法进行全面搜索, 以识别查询音频, 并在参考音频中准确估计其时间戳。 我们的框架包括培训有线电视网, 以尽可能扩大从清洁音频中提取的嵌入物与其相应的扭曲和时间变换版本之间的相似性。 我们使用一种有频道智慧的光谱时钟关注机制, 以更好地区分音频, 对信号中突出的光谱时空补带给予更多的权重 。 实验结果显示, 我们的系统在计算和记忆使用方面效率更高, 特别是在更高的扭曲级别上, 而不是相互竞争的州级系统, 和可扩缩到更大的数据库 。