Recently, fake news with text and images have achieved more effective diffusion than text-only fake news, raising a severe issue of multimodal fake news detection. Current studies on this issue have made significant contributions to developing multimodal models, but they are defective in modeling the multimodal content sufficiently. Most of them only preliminarily model the basic semantics of the images as a supplement to the text, which limits their performance on detection. In this paper, we find three valuable text-image correlations in multimodal fake news: entity inconsistency, mutual enhancement, and text complementation. To effectively capture these multimodal clues, we innovatively extract visual entities (such as celebrities and landmarks) to understand the news-related high-level semantics of images, and then model the multimodal entity inconsistency and mutual enhancement with the help of visual entities. Moreover, we extract the embedded text in images as the complementation of the original text. All things considered, we propose a novel entity-enhanced multimodal fusion framework, which simultaneously models three cross-modal correlations to detect diverse multimodal fake news. Extensive experiments demonstrate the superiority of our model compared to the state of the art.
翻译:最近,文字和图像的假新闻比只用文字的假新闻得到更有效的传播,这引起了多式联运假新闻的严重问题。目前关于该问题的研究为发展多式联运模式作出了重大贡献,但在模拟多式联运内容方面却有缺陷。其中多数只是初步模拟了图像的基本语义,作为文本的补充,从而限制了其检测的性能。在本文中,我们发现在多式联运假新闻中三种宝贵的文字图像相关关系:实体不一致、相互增强和文本补充。为了有效捕捉这些多式联运线索,我们创新地提取了视觉实体(如名人和地标),以了解与新闻有关的高层次图像的语义,然后在视觉实体的帮助下模拟多式联运实体的不一致和相互增强。此外,我们从图像中提取了嵌入的文字作为原始文本的补充。所有事情都考虑了,我们提出了一个新型实体强化的多式联运组合框架,它同时模拟了三种跨模式的多式联运假新闻。广泛的实验展示了我们模型相对于艺术状态的优越性。