Deep learning models trained on audio-visual data have been successfully used to achieve state-of-the-art performance for emotion recognition. In particular, models trained with multitask learning have shown additional performance improvements. However, such multitask models entangle information between the tasks, encoding the mutual dependencies present in label distributions in the real world data used for training. This work explores the disentanglement of multimodal signal representations for the primary task of emotion recognition and a secondary person identification task. In particular, we developed a multitask framework to extract low-dimensional embeddings that aim to capture emotion specific information, while containing minimal information related to person identity. We evaluate three different techniques for disentanglement and report results of up to 13% disentanglement while maintaining emotion recognition performance.
翻译:在视听数据方面受过培训的深层次学习模型已被成功地用于实现最先进的情感识别性能。特别是,经过多任务学习培训的模型显示出了更多的性能改进。然而,这些多任务模型将任务之间的信息缠绕在一起,将真实世界数据中标签分布中存在的相互依存性编码起来,用于培训的真实世界数据。这项工作探索了情感识别和二人识别这一主要任务的多式信号表达方式的分解。特别是,我们开发了一个多任务框架,以提取低维嵌入器,目的是捕捉特定情感的信息,同时包含与个人身份有关的最起码的信息。我们评估了三种不同的解密技术,并报告了高达13%的分解结果,同时保持情感识别性能。