用于早期疾病检测的多模态基础模型 (Multimodal Foundation Models for Early Disease Detection)

Healthcare data now span EHRs, medical imaging, genomics, and wearable sensors, but most diagnostic models still process these modalities in isolation. This limits their ability to capture early, cross-modal disease signatures. This paper introduces a multimodal foundation model built on a transformer architecture that integrates heterogeneous clinical data through modality-specific encoders and cross-modal attention. Each modality is mapped into a shared latent space and fused using multi-head attention with residual normalization. We implement the framework using a multimodal dataset that simulates early-stage disease patterns across EHR sequences, imaging patches, genomic profiles, and wearable signals, including missing-modality scenarios and label noise. The model is trained using supervised classification together with self-supervised reconstruction and contrastive alignment to improve robustness. Experimental evaluation demonstrates strong performance in early-detection settings, with stable classification metrics, reliable uncertainty estimates, and interpretable attention patterns. The approach moves toward a flexible, pretrain-and-fine-tune foundation model that supports precision diagnostics, handles incomplete inputs, and improves early disease detection across oncology, cardiology, and neurology applications.

翻译：当前的医疗健康数据涵盖了电子健康记录（EHR）、医学影像、基因组学以及可穿戴传感器数据，但大多数诊断模型仍孤立地处理这些模态。这限制了模型捕捉早期跨模态疾病特征的能力。本文提出了一种基于Transformer架构的多模态基础模型，该模型通过模态特定编码器和跨模态注意力机制整合异构临床数据。每种模态被映射到一个共享的潜在空间，并利用带残差归一化的多头注意力进行融合。我们采用一个多模态数据集实现该框架，该数据集模拟了早期疾病模式，涵盖EHR序列、影像切片、基因组谱和可穿戴设备信号，包括模态缺失场景和标签噪声。模型通过监督分类、自监督重建和对齐对比学习进行联合训练，以提升鲁棒性。实验评估表明，该模型在早期检测场景中表现出色，具有稳定的分类指标、可靠的不确定性估计和可解释的注意力模式。该方法旨在构建一个灵活、可预训练与微调的基础模型，以支持精准诊断、处理不完整输入，并提升肿瘤学、心脏病学和神经学应用中的早期疾病检测能力。