Electroencephalogram (EEG)-based emotion recognition is vital for affective computing but faces challenges in feature utilization and cross-domain generalization. This work introduces EmotionCLIP, which reformulates recognition as an EEG-text matching task within the CLIP framework. A tailored backbone, SST-LegoViT, captures spatial, spectral, and temporal features using multi-scale convolution and Transformer modules. Experiments on SEED and SEED-IV datasets show superior cross-subject accuracies of 88.69% and 73.50%, and cross-time accuracies of 88.46% and 77.54%, outperforming existing models. Results demonstrate the effectiveness of multimodal contrastive learning for robust EEG emotion recognition.
翻译:基于脑电图(EEG)的情感识别对于情感计算至关重要,但在特征利用和跨域泛化方面面临挑战。本研究提出EmotionCLIP方法,将识别任务重新定义为CLIP框架内的脑电-文本匹配任务。通过定制化的骨干网络SST-LegoViT,利用多尺度卷积和Transformer模块捕获空间、频谱及时间特征。在SEED和SEED-IV数据集上的实验显示,跨被试准确率分别达到88.69%和73.50%,跨时段准确率分别为88.46%和77.54%,优于现有模型。结果表明多模态对比学习对于实现鲁棒的脑电情感识别具有显著效果。