Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
翻译:音乐情感识别(MER)研究面临高质量标注数据集有限以及跨曲目特征漂移问题难以解决的挑战。本研究针对这些问题提出了两项主要贡献。首先,我们推出了Memo2496——一个大规模数据集,包含2496条器乐音轨,每条音轨均带有连续效价-唤醒度标签,并由30位认证音乐专家完成标注。通过使用极端情感样本进行校准,并设定效价-唤醒度空间中欧氏距离0.25的一致性阈值,确保了标注质量。其次,我们提出了双视图自适应音乐情感识别器(DAMER)。该框架整合了三个协同模块:双流注意力融合模块(DSAF)通过交叉注意力机制实现梅尔频谱图与耳蜗图在令牌级别的双向交互;渐进置信度标注模块(PCL)采用课程式温度调度策略,并利用詹森-香农散度进行一致性量化,以生成可靠的伪标签;风格锚定记忆学习模块(SAML)通过维护对比记忆队列来缓解跨曲目特征漂移。在Memo2496、1000songs和PMEmo数据集上的大量实验表明,DAMER实现了最先进的性能,在唤醒度维度上的准确率分别提升了3.43%、2.25%和0.17%。消融实验与可视化分析验证了各模块的有效性。本研究所用数据集及源代码均已公开。