In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.
翻译:在这项工作中,我们探讨了用于语音情绪识别的常量Q变换(CQT) 。基于CQT的时间频率分析提供了低频率高分辨率的可变光谱时空分辨率。由于低频语音信号区域所含的情感相关信息多于高频区域,CQT的低频分辨率增加使其对SER的希望大于标准的短期短时间Fourier变换(STFT) 。我们以STFT和CQT为基础,对具有深神经网络(DNN)的SER的短期声学特征进行了比较分析。我们优化了两种特征的不同参数。基于CQT的特征比SER实验的STFT光谱特征更优于SER的光谱特征。进一步的跨整体评估实验表明,基于CQT的系统为外部培训数据提供了更好的普及。