Conventional feature-based classification methods do not apply well to automatic recognition of speech emotions, mostly because the precise set of spectral and prosodic features that is required to identify the emotional state of a speaker has not been determined yet. This paper presents a method that operates directly on the speech signal, thus avoiding the problematic step of feature extraction. Furthermore, this method combines the strengths of the classical source-filter model of human speech production with those of the recently introduced liquid state machine (LSM), a biologically-inspired spiking neural network (SNN). The source and vocal tract components of the speech signal are first separated and converted into perceptually relevant spectral representations. These representations are then processed separately by two reservoirs of neurons. The output of each reservoir is reduced in dimensionality and fed to a final classifier. This method is shown to provide very good classification performance on the Berlin Database of Emotional Speech (Emo-DB). This seems a very promising framework for solving efficiently many other problems in speech processing.
翻译:常规地物分类方法并不适用于语音情绪的自动识别,这主要是因为确定演讲者情绪状态所需的一套精确的光谱和预言特征尚未确定。本文介绍了一种直接在语音信号上运行的方法,从而避免了有问题的地物提取步骤。此外,这种方法将人类语音制作古典源过滤模型的长处与最近引进的液态机器(LSM)的长处结合起来,后者是一个由生物引发的喷射神经网络(SNN)的长处。语音信号的源和声道组件首先分离,然后转换成具有概念相关性的光谱显示器。这些显示器随后由两个神经元库分别处理。每个储层的输出在维度上减少,并提供给最终的分类器。该方法表明,柏林的情感感官数据库(Emo-DB)的功能分类非常出色。这似乎是有效解决语音处理中许多其他问题的极有希望的框架。