Music emotion recognition is a key task in symbolic music understanding (SMER). Recent approaches have shown promising results by fine-tuning large-scale pre-trained models (e.g., MIDIBERT, a benchmark in symbolic music understanding) to map musical semantics to emotional labels. While these models effectively capture distributional musical semantics, they often overlook tonal structures, particularly musical modes, which play a critical role in emotional perception according to music psychology. In this paper, we investigate the representational capacity of MIDIBERT and identify its limitations in capturing mode-emotion associations. To address this issue, we propose a Mode-Guided Enhancement (MoGE) strategy that incorporates psychological insights on mode into the model. Specifically, we first conduct a mode augmentation analysis, which reveals that MIDIBERT fails to effectively encode emotion-mode correlations. We then identify the least emotion-relevant layer within MIDIBERT and introduce a Mode-guided Feature-wise linear modulation injection (MoFi) framework to inject explicit mode features, thereby enhancing the model's capability in emotional representation and inference. Extensive experiments on the EMOPIA and VGMIDI datasets demonstrate that our mode injection strategy significantly improves SMER performance, achieving accuracies of 75.2% and 59.1%, respectively. These results validate the effectiveness of mode-guided modeling in symbolic music emotion recognition.
翻译:音乐情感识别是符号音乐理解中的关键任务。近期研究通过微调大规模预训练模型(如符号音乐理解领域的基准模型MIDIBERT)将音乐语义映射至情感标签,已展现出有前景的结果。尽管这些模型能有效捕捉分布式的音乐语义,却常忽略音调结构——尤其是音乐调式。根据音乐心理学研究,调式在情感感知中起着关键作用。本文探究了MIDIBERT的表征能力,发现其在捕捉调式-情感关联方面存在局限。为解决此问题,我们提出一种融合心理学调式认知的调式引导增强策略。具体而言,我们首先进行调式增强分析,揭示MIDIBERT未能有效编码情感-调式相关性。随后识别出MIDIBERT中情感相关性最弱的层级,并引入调式引导的特征级线性调制注入框架,通过注入显式调式特征来增强模型的情感表征与推理能力。在EMOPIA和VGMIDI数据集上的大量实验表明,我们的调式注入策略显著提升了符号音乐情感识别性能,分别达到75.2%和59.1%的准确率。这些结果验证了调式引导建模在符号音乐情感识别中的有效性。