Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra- and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.
翻译:语言语言处理对于解决许多问题非常重要,例如情绪和神经认知障碍分析。最近,变异器在自然语言处理领域取得了显著成功,并展示了对语言的适应性。然而,语言领域的变异器先前的作品没有包含语言特性,使变异器的全部潜力没有被探索。在本文中,我们考虑了语言的特征,并提出了一个基于结构的总体框架,称为SpeopleFormer++,用于语言语言处理。更具体地说,根据语音信号的构成关系,我们设计了一个单元编码器,以高效模拟单元内和单元间的信息(即框架、电话和文字)。根据等级关系,我们利用合并块来生成不同变异变器的特性,这与语音信号的结构模式是一致的。此外,我们引入了一个词编码器,将单数特性纳入每个单元的变异化语音编码中,从而有效地平衡了微缩缩缩缩缩略图和微缩微缩略图信息。在语音情感识别(IEMOCAP-DA)中,将语音图调调调调调调高压(MA-DRismamas)任务中,而后又显示了ASyal-deal-destrislal-de-destrisldrodudustrislexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</s>