迈向医学数据的视觉-语言基础模型：越南语PET/CT报告生成的多模态数据集与基准 (Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation)

Huu Tien Nguyen,Dac Thai Nguyen,The Minh Duc Nguyen,Trung Thanh Nguyen,Thao Nguyen Truong,Huy Hieu Pham,Johan Barthelemy,Minh Quan Tran,Thanh Tam Nguyen,Quoc Viet Hung Nguyen,Quynh Anh Chau,Hong Son Mai,Thanh Trung Nguyen,Phi Le Nguyen

from arxiv, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs' learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.

翻译：视觉-语言基础模型通过在大规模多模态数据集上进行训练，推动了人工智能领域的显著进步，实现了丰富的跨模态推理。尽管这些模型在通用领域取得了成功，但由于多样化成像模态和多语言临床数据的可用性有限，将其应用于医学影像仍面临挑战。现有的大多数医学视觉-语言模型仅在部分成像模态上进行训练，且主要关注高资源语言，这限制了其泛化能力和临床实用性。为应对这些局限，我们引入了一个新颖的越南语多模态医学数据集，该数据集包含来自独立患者的2,757个全身PET/CT影像体积及其对应的完整临床报告。此数据集旨在填补医学人工智能发展中的两个紧迫缺口：（1）现有视觉-语言模型训练语料库中PET/CT影像数据的缺乏，这阻碍了能够处理功能成像任务的模型开发；（2）低资源语言，特别是越南语，在医学视觉-语言研究中的代表性不足。据我们所知，这是首个提供越南语全面PET/CT-报告配对的数据集。我们进一步引入了一个训练框架以增强视觉-语言模型的学习能力，包括数据增强和专家验证的测试集。我们在下游任务上对最先进的视觉-语言模型进行了全面的基准实验。实验结果表明，结合我们的数据集能显著提升现有视觉-语言模型的性能。我们相信，该数据集与基准将作为推动更鲁棒的医学影像视觉-语言模型发展的关键一步，特别是针对低资源语言及越南医疗保健的临床应用。源代码发布于https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日