Most previous work in music emotion recognition assumes a single or a few song-level labels for the whole song. While it is known that different emotions can vary in intensity within a song, annotated data for this setup is scarce and difficult to obtain. In this work, we propose a method to predict emotion dynamics in song lyrics without song-level supervision. We frame each song as a time series and employ a State Space Model (SSM), combining a sentence-level emotion predictor with an Expectation-Maximization (EM) procedure to generate the full emotion dynamics. Our experiments show that applying our method consistently improves the performance of sentence-level baselines without requiring any annotated songs, making it ideal for limited training data scenarios. Further analysis through case studies shows the benefits of our method while also indicating the limitations and pointing to future directions.
翻译:大部分以前在音乐情感识别方面的工作都假定整个歌曲有一个或几个歌级标签。 虽然人们知道不同情感在歌曲中的强度不同,但这种设置的附加说明数据很少,也很难获得。 在这项工作中,我们提出一种在歌曲歌词中预测情感动态的方法,而没有歌级监督。我们将每首歌标定为一个时间序列,并使用州空间模型(SSSM),将句级情绪预测器与期望-最大化(EM)程序相结合,以产生完整的情感动态。我们的实验显示,运用我们的方法,在不需要附加说明的歌曲的情况下,始终在改进句级基线的性能,使得它成为有限的培训数据情景的理想。通过案例研究进行的进一步分析显示了我们方法的优点,同时也指出了局限性和对未来方向的指向。