Despite the widespread utilization of deep neural networks (DNNs) for speech emotion recognition (SER), they are severely restricted due to the paucity of labeled data for training. Recently, segment-based approaches for SER have been evolving, which train backbone networks on shorter segments instead of whole utterances, and thus naturally augments training examples without additional resources. However, one core challenge remains for segment-based approaches: most emotional corpora do not provide ground-truth labels at the segment level. To supervisely train a segment-based emotion model on such datasets, the most common way assigns each segment the corresponding utterance's emotion label. However, this practice typically introduces noisy (incorrect) labels as emotional information is not uniformly distributed across the whole utterance. On the other hand, DNNs have been shown to easily over-fit a dataset when being trained with noisy labels. To this end, this work proposes a simple and effective deep self-learning (DSL) framework, which comprises a procedure to progressively correct segment-level labels in an iterative learning manner. The DSL method produces dynamically-generated and soft emotion labels, leading to significant performance improvements. Experiments on three well-known emotional corpora demonstrate noticeable gains using the proposed method.
翻译:尽管广泛利用深神经网络(DNNs)来进行言语情绪识别,但由于缺少有标签的培训数据,这些网络受到严重限制。最近,SER的基于部分的方法不断演变,在较短的部位上培训主干网,而不是整个语句,从而自然地增加培训实例,而没有额外的资源。然而,基于部分的方法仍面临一个核心挑战:大多数情感共体不提供分层一级的地面真实标签。在这种数据集上监督地培训基于部分的情感模型,这是每个部分分配相应语句情感标签的最常见方式。然而,这种做法通常会引入噪音(不正确的)标签,因为情绪信息不统一地分布在整个语句中。另一方面,DNNS被证明,在用吵闹的标签培训时,很容易为数据集配置过多。为此,这项工作提出了一个简单而有效的自我学习框架,其中包括一种以迭代学习方式逐步纠正分层标签的程序。DSL方法通常引入噪音(不正确的)标签,因为情绪信息不统一分布,因为情绪信息不统一地分布在整个语句中。另一方面,DSLSL方法显示在以显著的实验方式展示了令人瞩目的的成绩。