This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectro-temporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in constant-Q-based time-frequency representations than MFSC. These advantages of constant-Q representations are further consolidated by SER performance in the extensive evaluation of features over four publicly available databases with six advanced deep neural network architectures as the back-end classifiers. Our inferences in this study hint toward the suitability and potentiality of constant-Q features for SER.
翻译:这项工作分析了基于常量过滤银行的时间频率代表器,用于语音情绪识别(SER) 。 常量过滤银行提供非线性光谱时空代表器,在低频中提供高频分辨率。 我们的调查揭示了低频分辨率的增加如何有利于SER。 短期中频频谱系数(MFSCs)与基于常量过滤银行的特征(即常量转换(CQT)和连续波盘变(CWT))之间的时间差异比较分析显示,常量表示器在低频中提供更高的时间差异。 常量显示,常量代表器在对四个公开数据库的特征进行广泛评价时,其六种深层神经网络结构作为后端分类者,其功能更加强健。 我们在这项研究中对不同情感类别中的频度分析显示,与MFSC相比,在基于常量基时频频率的表达方式中,声调调的分辨率的分辨率和潜在特征得到了更好的解析。