Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous pianists) of Bach's Well-Tempered Clavier Book 1. These features include low-level acoustic features, score-based features, features extracted using a pre-trained emotion model, and Mid-level perceptual features. We compare their predictive power by evaluating them on several experiments designed to test performance-wise or piece-wise variations of emotion. We find that Mid-level features show significant contribution in performance-wise variation of both arousal and valence -- even better than the pre-trained emotion model. Our findings add to the evidence of Mid-level perceptual features being an important representation of musical attributes for several tasks -- specifically, in this case, for capturing the expressive aspects of music that manifest as perceived emotion of a musical performance.
翻译:尽管最近在音频内容的音乐情绪认识方面取得了进展,但有待探讨的一个问题是,一个算法能否可靠地辨别出同一作品不同表演之间的情感或表达品质。在目前的工作中,我们分析了数组特征,这些特征在预测Bach的《精致的感官书》六种不同表演(由六名著名的钢琴家编写)的振奋和价值方面的有效性,这些特征包括低层次的声学特征、分数特征、使用预先训练的情感模型提取的特征和中层感知特征。我们通过在几个实验中比较它们的预测力,这些实验旨在测试性或片断的情感变化。我们发现,中层特征在振奋和价值两方面的性能变化方面都作出了重大贡献,甚至比预先训练的情感模型要好。我们的调查结果补充了中级感官特征的证据,这些特征是若干任务中音乐属性的重要表现 -- 具体来说,用于捕捉音乐表现为音乐表演感知感的情感的明显方面。