Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions.
翻译:近年来,基于深度学习的方法已成为覆盖歌曲识别(CSI)的范例。ByteCover 系列在所有主流 CSI 数据集上取得了最先进的成果。然而,随着短视频的兴起,许多实际应用需要将短音乐片段与数据库中的全长音乐曲目进行匹配,这仍然是一个尚未得到充分探索并等待工业级解决方案的领域。在本文中,我们升级了以前的 ByteCover 系统到 ByteCover3,利用本地特征进一步提高了短音乐查询的识别性能。ByteCover3 设计有局部对齐损失(LAL)模块和两阶段特征检索管道,使系统能够以更精确、更高效的方式进行 CSI。我们在多个数据集上使用不同的基准设置评估了 ByteCover3,其中 ByteCover3 打败了所有比较方法,包括以前的版本。