Fueled by recent advances of self-supervised models, pre-trained speech representations proved effective for the downstream speech emotion recognition (SER) task. Most prior works mainly focus on exploiting pre-trained representations and just adopt a linear head on top of the pre-trained model, neglecting the design of the downstream network. In this paper, we propose a temporal shift module to mingle channel-wise information without introducing any parameter or FLOP. With the temporal shift module, three designed baseline building blocks evolve into corresponding shift variants, i.e. ShiftCNN, ShiftLSTM, and Shiftformer. Moreover, to balance the trade-off between mingling and misalignment, we propose two technical strategies, placement of shift and proportion of shift. The family of temporal shift models all outperforms the state-of-the-art methods on the benchmark IEMOCAP dataset under both finetuning and feature extraction settings. Our code is available at https://github.com/ECNU-Cross-Innovation-Lab/ShiftSER.
翻译:在自监督模式最近进展的推动下游言语情绪识别(SER)任务中,经过培训的言语表现证明是有效的,大多数先前的工作主要侧重于利用经培训前的言语表现,在经过培训的模范顶部采用线性头部,忽视下游网络的设计。在本文中,我们提议一个时间转变模块,在不引入任何参数或FLOP的情况下,将频道信息混杂在一起。随着时间转移模块,三个设计的基准构件演变成相应的变换,即 ShiftCNN、 ShiftLSTM 和 Shiftfrefort。此外,为了平衡混合和错配,我们提出了两个技术战略,即调整和调整比例。时间转换模型的组合在微调和地貌提取设置下都超越了IEMOCAP基准数据集的最新方法。我们的代码可在https://github.com/ECNU-Cros-Innovation-Lab/Shifteure处查阅。</s>