Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.
翻译:信息疫情与健康误导信息对个人和社会具有显著的负面影响,加剧了公众的困惑,并提高了对推荐健康措施的犹豫程度。近期生成式人工智能的进展能够生成逼真、类人的文本与图像,极大地加速了健康误导信息的传播并扩展了其覆盖范围,导致其扩散呈现令人担忧的激增态势。为应对信息疫情,现有研究大多集中于从社交媒体和事实核查平台构建误导信息数据集,但在主题覆盖范围、人工智能生成内容的纳入以及原始内容的可访问性方面存在局限。为解决这些问题,我们提出了MM Health——一个大规模健康领域多模态误导信息数据集,包含34,746篇涵盖文本与视觉信息的新闻文章。MM Health包含来自多种最先进生成式人工智能模型的人类生成多模态信息(5,776篇文章)与人工智能生成多模态信息(28,880篇文章)。此外,我们在三项任务(可靠性验证、原创性验证和细粒度人工智能检测)上对本数据集进行了基准测试,结果表明现有最先进模型难以准确区分信息的可靠性与来源。本数据集旨在支持跨多种健康场景的误导信息检测技术发展,促进在多模态层面上对人类生成与机器生成内容的识别。