Emotion recognition as a key component of high-stake downstream applications has been shown to be effective, such as classroom engagement or mental health assessments. These systems are generally trained on small datasets collected in single laboratory environments, and hence falter when tested on data that has different noise characteristics. Multiple noise-based data augmentation approaches have been proposed to counteract this challenge in other speech domains. But, unlike speech recognition and speaker verification, in emotion recognition, noise-based data augmentation may change the underlying label of the original emotional sample. In this work, we generate realistic noisy samples of a well known emotion dataset (IEMOCAP) using multiple categories of environmental and synthetic noise. We evaluate how both human and machine emotion perception changes when noise is introduced. We find that some commonly used augmentation techniques for emotion recognition significantly change human perception, which may lead to unreliable evaluation metrics such as evaluating efficiency of adversarial attack. We also find that the trained state-of-the-art emotion recognition models fail to classify unseen noise-augmented samples, even when trained on noise augmented datasets. This finding demonstrates the brittleness of these systems in real-world conditions. We propose a set of recommendations for noise-based augmentation of emotion datasets and for how to deploy these emotion recognition systems "in the wild".
翻译:作为高取量下游应用的关键组成部分的情感认识被证明是有效的,例如课堂参与或心理健康评估。这些系统一般在单一实验室环境中收集的小型数据集方面受过培训,因此在测试具有不同噪音特性的数据时会动摇。提出了多种基于噪音的数据增强方法,以克服其他演讲领域的这一挑战。但是,与语音认识和语音核实不同的是,基于噪音的数据增强可能改变原始情感样本的基本标签。在这项工作中,我们利用多种类型的环境和合成噪音,制作出一个众所周知的情感数据集(IEMOCAP)的现实噪音样本。我们评估在引入噪音时,人类和机器情感感知如何变化。我们发现,一些常用的增强情感认知的增强技术会显著改变人类感知,这可能导致评估对抗性攻击效率等不可靠的评价指标。我们还发现,经过培训的状态和基于情绪的识别模型无法对原始情感样本进行分类,即便在进行关于噪音增强数据集的培训时也是如此。我们发现,这些系统在现实世界条件下会变得很不灵活。我们建议一套用于将噪音增强的情绪感官系统。