The easy sharing of multimedia content on social media has caused a rapid dissemination of fake news, which threatens society's stability and security. Therefore, fake news detection has garnered extensive research interest in the field of social forensics. Current methods primarily concentrate on the integration of textual and visual features but fail to effectively exploit multi-modal information at both fine-grained and coarse-grained levels. Furthermore, they suffer from an ambiguity problem due to a lack of correlation between modalities or a contradiction between the decisions made by each modality. To overcome these challenges, we present a Multi-grained Multi-modal Fusion Network (MMFN) for fake news detection. Inspired by the multi-grained process of human assessment of news authenticity, we respectively employ two Transformer-based pre-trained models to encode token-level features from text and images. The multi-modal module fuses fine-grained features, taking into account coarse-grained features encoded by the CLIP encoder. To address the ambiguity problem, we design uni-modal branches with similarity-based weighting to adaptively adjust the use of multi-modal features. Experimental results demonstrate that the proposed framework outperforms state-of-the-art methods on three prevalent datasets.
翻译:社交媒体上多媒体内容的易传播性导致了虚假新闻的迅速传播,这威胁着社会的稳定和安全。因此,虚假新闻检测已经成为社交取证领域中广泛研究的热点。目前的方法主要集中在将文本和图像特征整合在一起,但是未能有效地利用多粒度和多模态信息。此外,由于模态之间缺乏相关性或决策之间存在矛盾,它们存在一个歧义问题。为了克服这些挑战,我们提出了一种基于多粒度信息融合的多模态融合网络(MMFN)用于虚假新闻检测。受人类评估新闻真实性的多粒度过程的启发,我们分别使用两个基于Transformer的预训练模型来对文本和图像进行标记级别的特征编码。多模态模块融合了细粒度的特征,同时也考虑了CLIP编码器编码的粗粒度特征。为了解决歧义问题,我们设计了基于相似度加权的单模态分支来自适应地调整多模态特征的使用。实验结果表明,所提出的框架在三个流行数据集上优于当前的最先进方法。