Multimodal Misinformation Detection (MMD) refers to the task of detecting social media posts involving misinformation, where the post often contains text and image modalities. However, by observing the MMD posts, we hold that the text modality may be much more informative than the image modality because the text generally describes the whole event/story of the current post but the image often presents partial scenes only. Our preliminary empirical results indicate that the image modality exactly contributes less to MMD. Upon this idea, we propose a new MMD method named RETSIMD. Specifically, we suppose that each text can be divided into several segments, and each text segment describes a partial scene that can be presented by an image. Accordingly, we split the text into a sequence of segments, and feed these segments into a pre-trained text-to-image generator to augment a sequence of images. We further incorporate two auxiliary objectives concerning text-image and image-label mutual information, and further post-train the generator over an auxiliary text-to-image generation benchmark dataset. Additionally, we propose a graph structure by defining three heuristic relationships between images, and use a graph neural network to generate the fused features. Extensive empirical results validate the effectiveness of RETSIMD.
翻译:多模态虚假信息检测(MMD)旨在检测社交媒体中包含虚假信息的帖子,这些帖子通常包含文本和图像两种模态。然而,通过观察MMD帖子,我们认为文本模态可能比图像模态提供更多信息,因为文本通常描述了当前帖子的完整事件/故事,而图像往往仅呈现部分场景。我们的初步实证结果表明,图像模态对MMD的贡献确实较小。基于这一观点,我们提出了一种名为RETSIMD的新MMD方法。具体而言,我们假设每个文本可被划分为若干片段,每个文本片段描述了一个可由图像呈现的部分场景。相应地,我们将文本分割为一系列片段,并将这些片段输入预训练的文本到图像生成器,以生成一系列增强图像。我们进一步引入了两个关于文本-图像和图像-标签互信息的辅助目标,并在辅助的文本到图像生成基准数据集上对生成器进行进一步后训练。此外,我们通过定义图像之间的三种启发式关系构建图结构,并利用图神经网络生成融合特征。大量实证结果验证了RETSIMD的有效性。