通过多层级模态增强与双阶段模态融合提升表情包情感理解 (Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion)

With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.

翻译：随着社交媒体与网络文化的迅速兴起，表情包已成为表达情感倾向的热门媒介。这激发了人们对表情包情感理解（MEU）日益增长的研究兴趣，该任务旨在利用表情包的多模态内容对其背后的情感意图进行分类。尽管现有研究已取得显著成果，但仍面临两大挑战：（1）缺乏细粒度的多模态融合策略；（2）对表情包隐含意义与背景知识的挖掘不足。为应对这些挑战，我们提出了MemoDetector，一种用于推进MEU的新型框架。首先，我们引入了一个四步文本增强模块，利用多模态大语言模型（MLLMs）的丰富知识与推理能力，逐步推断并提取表情包中的隐含与上下文信息。这些增强后的文本显著丰富了原始表情包内容，并为下游分类任务提供了有价值的指导。其次，我们设计了一种双阶段模态融合策略：第一阶段对原始表情包图像与文本进行浅层融合，第二阶段则深度融合增强后的视觉与文本特征。这种层级式融合使模型能更好地捕捉细微的跨模态情感线索。在MET-MEME和MOOD两个数据集上的实验表明，我们的方法始终优于现有先进基线模型。具体而言，MemoDetector在MET-MEME上的F1分数提升了4.3%，在MOOD上提升了3.4%。进一步的消融研究与深入分析验证了我们方法的有效性与鲁棒性，凸显了其在推进MEU研究方面的强大潜力。我们的代码已公开于https://github.com/singing-cat/MemoDetector。