Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
翻译:语音情绪识别对于人与计算机的互动至关重要。 代表不同情绪的时空区域在演讲的当地不同部分中散布。 此外, 重要信息的时间尺度在演讲部分内部和不同部分之间可能有很大差异。 虽然以变压器为基础的模型在这一领域取得了进展, 但现有模型无法准确定位不同时间尺度的重要区域。 为了解决这个问题, 我们提议了动态窗口转换器( DWFormer ) ( DWFormer ), 这是一种通过动态将样本分割成窗口来利用时间重要性的新架构。 自我注意机制被应用在窗口中, 以便以细微的方式捕捉到当地具有时间重要性的信息。 全球通信中也考虑到跨窗口的信息互动。 DWFormer 在 IEMOC 和 MELD 数据集中都得到了评估。 实验结果表明, 拟议的模型比以往的状态方法取得更好的业绩。</s>