In recent years, the problem of misinformation on the web has become widespread across languages, countries, and various social media platforms. Although there has been much work on automated fake news detection, the role of images and their variety are not well explored. In this paper, we investigate the roles of image and text at an earlier stage of the fake news detection pipeline, called claim detection. For this purpose, we introduce a novel dataset, MM-Claims, which consists of tweets and corresponding images over three topics: COVID-19, Climate Change and broadly Technology. The dataset contains roughly 86000 tweets, out of which 3400 are labeled manually by multiple annotators for the training and evaluation of multimodal models. We describe the dataset in detail, evaluate strong unimodal and multimodal baselines, and analyze the potential and drawbacks of current models.
翻译:近年来,网络上的错误信息问题在语言、国家和各种社交媒体平台上变得十分普遍,尽管在自动假新闻探测方面做了大量工作,但图像的作用及其种类没有很好地探讨。在本文中,我们调查了在假新闻探测管道早期的图像和文本的作用,称为索赔探测。为此,我们引入了一个新的数据集,即MMM-要求,其中包括推文以及三个主题的对应图像:COVID-19、气候变化和广域技术。数据集包含大约86 000条推文,其中3 400条由多式模型培训和评价的多位告示员手工标注。我们详细描述数据集,评估强有力的单式和多式基线,分析当前模型的潜力和缺陷。