Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model proved to be more accurate and robust compared to fully-supervised methods on low data regimes.
翻译:最近,基于周边生理信号的可穿戴式情感识别因其非侵入性和在实际场景中的适用性而引起了广泛的关注。然而,如何有效地融合多模态数据仍然是一个具有挑战性的问题。此外,传统的完全监督方法在有限的标注数据下容易出现过拟合的情况。为了解决上述问题,我们提出了一种新颖的自监督学习(SSL)框架,在其中通过基于时间卷积的模态特定编码器和基于Transformer的共享编码器实现了高效的多模态融合,捕捉了模态内和模态间的相关性。通过五个信号转换自动为大量无标签数据分配标签,并将所提出的SSL模型预训练为信号转换识别的预文本任务,从而提取出用于情感相关下游任务的广义多模态表示。为了评估所提出的方法,我们首先将所提出的SSL模型在自行收集的大规模生理数据集上进行预训练,并将所得的编码器固定或微调在三个公共监督情感识别数据集上。最终,我们基于SSL的方法在各种情感分类任务上实现了最先进的结果。同时,在低数据情况下,所提出的模型证明比完全监督方法更准确和稳健。