Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to take further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling because it can enable a deeper analysis of affective states. Experimental results show that this two-stage, late-fusion approach, obtains higher performance than that of any one-stage processing, with a linear correlation from one-stage to two-stage processing. This late-fusion approach improves previous early fusion results measured in concordance correlation coefficients score.
翻译:计算机自动言语情绪识别( SER) 是更自然的人与机器互动的关键组成部分。 在人与人的互动中, 正确感知情感的能力对于在特定情况下采取进一步步骤至关重要。 SER 的一个问题是, 是否有必要将声学特征与其他数据( 如面部表达、 文本和运动捕捉)结合起来。 这项研究建议通过采用由两步组成的延迟融合方法, 将声学和文字信息结合起来。 首先, 声学和文字功能在深层学习系统中单独培训。 其次, 深层学习系统的预测结果被输入支持矢量机( SVM ), 以预测最终回归得分。 此外, 这项研究的任务是进行维度情感建模, 因为它能够更深入地分析感知状态。 实验结果显示, 这种两阶段、 迟融合方法的性能高于任何一阶段处理, 从一阶段到两阶段处理的线性关系。 这种迟融合方法改进了先前在一致性相关系数评分中测量的早期融合结果 。