In the realm of music information retrieval, similarity-based retrieval and auto-tagging serve as essential components. Given the limitations and non-scalability of human supervision signals, it becomes crucial for models to learn from alternative sources to enhance their performance. Self-supervised learning, which exclusively relies on learning signals derived from music audio data, has demonstrated its efficacy in the context of auto-tagging. In this study, we propose a model that builds on the self-supervised learning approach to address the similarity-based retrieval challenge by introducing our method of metric learning with a self-supervised auxiliary loss. Furthermore, diverging from conventional self-supervised learning methodologies, we discovered the advantages of concurrently training the model with both self-supervision and supervision signals, without freezing pre-trained models. We also found that refraining from employing augmentation during the fine-tuning phase yields better results. Our experimental results confirm that the proposed methodology enhances retrieval and tagging performance metrics in two distinct scenarios: one where human-annotated tags are consistently available for all music tracks, and another where such tags are accessible only for a subset of tracks.
翻译:在音乐信息检索领域,基于相似度的检索和自动标记是必不可少的组成部分。鉴于人工监督信号的限制性和不可扩展性,让模型从替代来源中学习来提升其性能变得至关重要。自监督学习专门依赖于从音乐音频数据中派生的学习信号,在自动标记的背景下已经证明了其有效性。在本研究中,我们提出了一种模型,建立在自监督学习方法的基础上,通过引入自监督辅助损失的度量学习方法来解决基于相似度的检索挑战。此外,我们发现与传统的自监督学习方法不同,同时训练模型使用自监督和监督信号,而不是冻结预训练模型,有益于提高性能。我们还发现,在微调阶段避免使用数据增强可以获得更好的结果。我们的实验结果证实了所提出的方法在两种不同场景下均可提高检索和标记性能指标:一种场景是所有音乐曲目都有人工标注的标记,另一种场景是仅对一部分曲目有这些标记。