Healthcare data now span EHRs, medical imaging, genomics, and wearable sensors, but most diagnostic models still process these modalities in isolation. This limits their ability to capture early, cross-modal disease signatures. This paper introduces a multimodal foundation model built on a transformer architecture that integrates heterogeneous clinical data through modality-specific encoders and cross-modal attention. Each modality is mapped into a shared latent space and fused using multi-head attention with residual normalization. We implement the framework using a multimodal dataset that simulates early-stage disease patterns across EHR sequences, imaging patches, genomic profiles, and wearable signals, including missing-modality scenarios and label noise. The model is trained using supervised classification together with self-supervised reconstruction and contrastive alignment to improve robustness. Experimental evaluation demonstrates strong performance in early-detection settings, with stable classification metrics, reliable uncertainty estimates, and interpretable attention patterns. The approach moves toward a flexible, pretrain-and-fine-tune foundation model that supports precision diagnostics, handles incomplete inputs, and improves early disease detection across oncology, cardiology, and neurology applications.
翻译:当前的医疗健康数据涵盖了电子健康记录(EHR)、医学影像、基因组学以及可穿戴传感器数据,但大多数诊断模型仍孤立地处理这些模态。这限制了模型捕捉早期跨模态疾病特征的能力。本文提出了一种基于Transformer架构的多模态基础模型,该模型通过模态特定编码器和跨模态注意力机制整合异构临床数据。每种模态被映射到一个共享的潜在空间,并利用带残差归一化的多头注意力进行融合。我们采用一个多模态数据集实现该框架,该数据集模拟了早期疾病模式,涵盖EHR序列、影像切片、基因组谱和可穿戴设备信号,包括模态缺失场景和标签噪声。模型通过监督分类、自监督重建和对齐对比学习进行联合训练,以提升鲁棒性。实验评估表明,该模型在早期检测场景中表现出色,具有稳定的分类指标、可靠的不确定性估计和可解释的注意力模式。该方法旨在构建一个灵活、可预训练与微调的基础模型,以支持精准诊断、处理不完整输入,并提升肿瘤学、心脏病学和神经学应用中的早期疾病检测能力。