Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatically retrieving convincing images for a given caption, capturing cases with inconsistent entities or semantic context. Our large-scale automatically generated NewsCLIPpings Dataset: (1) demonstrates that machine-driven image repurposing is now a realistic threat, and (2) provides samples that represent challenging instances of mismatch between text and image in news that are able to mislead humans. We benchmark several state-of-the-art multimodal models on our dataset and analyze their performance across different pretraining domains and visual backbones.
翻译:在线错误信息是一个普遍的社会问题,对手依赖从廉价假冒到精密假冒等工具。我们的动机是威胁情景,将图像从上下文中用于支持某种描述。虽然一些先前用于检测图像文本不一致的数据集通过文本操纵生成样本,但我们提议了一个数据集,其中图像和文本均不受控制但互不匹配。我们引入了数种战略,用于自动检索特定字幕的可信图像,捕捉实体或语义环境不一致的案例。我们的大规模自动生成新闻CLIPpings数据集:(1) 显示机器驱动图像重新定位现在是一种现实的威胁,(2) 提供样本,表明在能够误导人类的新闻中文本和图像之间出现不匹配的挑战性实例。我们用一些最先进的多式模型来衡量我们的数据集,并分析它们在不同预培训领域和视觉主干线上的性能。