Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset.
翻译:在过去几年里,对音乐基因分类进行了广泛的研究,以了解音乐信息检索的各种应用。以前的工作往往表现不令人满意,因为这些方法只使用音频内容,或联合使用音频内容和歌词内容,效率低。此外,作为音乐音轨中通常同时出现的基因,可取的做法是捕捉和模拟基因相关性,以提高多标签音乐基因分类的性能。为了解决这些问题,我们提出了一种新型的多模式方法,利用音频分析损失和两个对称交叉调时的注意,以调和调音和歌词中的引信特征。此外,根据多标签分类的性质,还介绍了一种基因关联提取模块,以捕捉和建模潜在基因相关性。广泛的实验表明,我们拟议的方法大大超过其他多标签音乐基因分类方法,并在Misc4All数据集上取得最新艺术结果。</s>