从生成到检测：用于健康信息误导基准测试的多模态多任务数据集 (From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation)

Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

翻译：信息疫情与健康误导信息对个人和社会具有显著的负面影响，加剧了公众的困惑，并提高了对推荐健康措施的犹豫程度。近期生成式人工智能的进展能够生成逼真、类人的文本与图像，极大地加速了健康误导信息的传播并扩展了其覆盖范围，导致其扩散呈现令人担忧的激增态势。为应对信息疫情，现有研究大多集中于从社交媒体和事实核查平台构建误导信息数据集，但在主题覆盖范围、人工智能生成内容的纳入以及原始内容的可访问性方面存在局限。为解决这些问题，我们提出了MM Health——一个大规模健康领域多模态误导信息数据集，包含34,746篇涵盖文本与视觉信息的新闻文章。MM Health包含来自多种最先进生成式人工智能模型的人类生成多模态信息（5,776篇文章）与人工智能生成多模态信息（28,880篇文章）。此外，我们在三项任务（可靠性验证、原创性验证和细粒度人工智能检测）上对本数据集进行了基准测试，结果表明现有最先进模型难以准确区分信息的可靠性与来源。本数据集旨在支持跨多种健康场景的误导信息检测技术发展，促进在多模态层面上对人类生成与机器生成内容的识别。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日