Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time-frequency pattern of speech signal for speech emotion recognition (SER). \textcolor{black}{Generally, different emotions correspond to specific energy activations both within frequency bands and time frames on spectrogram, which indicates the frequency and time domains are both essential to represent the emotion for SER. However, recent spectrogram-based works mainly focus on modeling the long-term dependency in time domain, leading to these methods encountering the following two issues: (1) neglecting to model the emotion-related correlations within frequency domain during the time-frequency joint learning; (2) ignoring to capture the specific frequency bands associated with emotions.} To cope with the issues, we propose an attentive time-frequency neural network (ATFNN) for SER, including a time-frequency neural network (TFNN) and time-frequency attention. Specifically, aiming at the first issue, we design a TFNN with a frequency-domain encoder (F-Encoder) based on the Transformer encoder and a time-domain encoder (T-Encoder) based on the Bidirectional Long Short-Term Memory (Bi-LSTM). The F-Encoder and T-Encoder model the correlations within frequency bands and time frames, respectively, and they are embedded into a time-frequency joint learning strategy to obtain the time-frequency patterns for speech emotions. Moreover, to handle the second issue, we also adopt time-frequency attention with a frequency-attention network (F-Attention) and a time-attention network (T-Attention) to focus on the emotion-related frequency band ranges and time frame ranges, which can enhance the discriminability of speech emotion features.
翻译:Spectrogram通常被用作深神经网络的输入特征,用于学习高(er)级时间-频率状态的语音信号用于语音情绪识别(SER)的高(er)级时间-频率模式(SER)。\textcolor{black}一般而言,不同的情感与频率波段和光谱时间框架内的特定能量激活相对应,这表明频率和时间范围对于代表SER的情感至关重要。然而,最近基于光谱的工作主要侧重于模拟时间域的长期依赖性,导致这些方法遇到以下两个问题:(1) 在时间-频率联合学习期间,忽略在频率域域域内建立情感相关关系模型;(2) 忽略捕捉与情感相关的特定频率波段。}为了应对这些问题,我们提议为SER建立一个关注的时间-频率神经网络(ATFNN),包括时间-频率神经网络(TNN)和时间频率频率频率关注。具体地说,我们设计一个具有频率-多频率语音网络(F-Encoder),在变换时代时间计算和时间-时间动态网络(Ender-Deder-deal-deal-de-le-le-le-le-le-de-le-le-le-le-le-le-leader-de-le-le-le-le-lester-leader-leader-de-de-de-de-de-de-de-de-de-de-de-legilder-leader-de-leader-leader-de-lex-leg-leg-legil-lemental-lemental-lement-lement-lemental-le-de-le-le-le-lement-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-lemental-lemental-lemental-lemental-de-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-le-