Speech transcription, emotion recognition, and language identification are usually considered to be three different tasks. Each one requires a different model with a different architecture and training process. We propose using a recurrent neural network transducer (RNN-T)-based speech-to-text (STT) system as a common component that can be used for emotion recognition and language identification as well as for speech recognition. Our work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets. In addition, we demonstrate that by adding a lightweight component to the RNN-T module, it can also be used for language identification. In our evaluations, this new classifier demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset.
翻译:语音转录、情感识别和语言识别通常被视为三种不同的任务。 每一种任务都需要不同的模式,有不同的架构和培训过程。 我们提议使用一个基于神经网络的经常性语音对文本传输器(RNN-T)系统,作为可用于情感识别和语言识别以及语音识别的一个共同组成部分。我们的工作通过微小的修改扩展了STT情感分类系统,并展示了IMOC和MELD数据集的成功结果。此外,我们还证明,通过在RNN-T模块中添加一个轻量级组件,它也可以用于语言识别。在我们的评价中,这个新的分类器展示了NIST-LRE-07数据集的最新准确性。