Recent years have witnessed the sustained evolution of misinformation that aims at manipulating public opinions. Unlike traditional rumors or fake news editors who mainly rely on generated and/or counterfeited images, text and videos, current misinformation creators now more tend to use out-of-context multimedia contents (e.g. mismatched images and captions) to deceive the public and fake news detection systems. This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information. To address this challenge, in this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions, which is helpful for fact-check websites to document clarifications. The proposed model first symbolically disassembles the text-modality information to a set of fact queries based on the Abstract Meaning Representation of the caption and then forwards the query-image pairs into a pre-trained large vision-language model select the ``evidences" that are helpful for us to detect misinformation. Extensive experiments indicate that the proposed methodology can provide us with much more interpretable predictions while maintaining the accuracy same as the state-of-the-art model on this task.
翻译:近些年来,见证了以操纵公众舆论为目的的错误信息的持续演变。与依赖生成和/或伪造的图像,文本和视频的传统谣言或虚假新闻编辑不同,目前的错误信息创造者更倾向于使用异于语境的多媒体内容(如不匹配的图像和标题)来欺骗公众和新闻检测系统。这种新型错误信息的增加不仅增加了检测的难度,而且增加了澄清的难度,因为每个单独的模态都足够接近真实信息。为了解决这个挑战,在本文中,我们探讨如何实现可解释的跨模态去上下文检测,同时识别不匹配的对和跨模态的矛盾,这对于事实检查网站记录澄清非常有帮助。所提出的模型首先从字幕的抽象意义表示出发,符号地分解文本模态信息成一组事实查询,然后将查询图像对转发到预先训练的大型视觉-语言模型中,选择有助于我们检测错误信息的“证据”。广泛的实验表明,所提出的方法在保持准确性与此任务的最新技术相同的同时,可以为我们提供更可解释的预测。