Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
翻译:糖尿病视网膜病变(DR)是全球可预防性失明的主要原因,亟需精准的自动化诊断系统。尽管通用领域的视觉-语言模型(如对比语言-图像预训练模型CLIP)在自然图像任务中表现优异,但在医学领域应用中表现欠佳,尤其是在眼科图像的跨模态检索方面。本文提出了一种新颖的知识增强联合嵌入框架,通过多模态Transformer架构整合视网膜眼底图像、临床文本和结构化患者数据,以解决医学图像-文本对齐中的关键差距。我们的方法为每种模态采用独立的编码器:用于视网膜图像的Vision Transformer(ViT-B/16)、用于临床叙述的Bio-ClinicalBERT,以及用于结构化人口统计学与临床特征的多层感知机。这些模态通过具有模态特定嵌入的联合Transformer进行融合,并采用多目标训练,包括模态对之间的对比损失、图像与文本的重建损失,以及依据ICDR和SDRG方案进行DR严重程度分级的分类损失。在巴西多标签眼科数据集(BRSET)上的实验结果表明,该方法较基线模型有显著提升。我们的框架实现了近乎完美的文本-图像检索性能,Recall@1达到99.94%,而微调CLIP仅为1.29%;同时保持了最先进的分类准确率,SDRG为97.05%,ICDR为97.97%。此外,在未见过的DeepEyeNet数据集上的零样本评估验证了强大的泛化能力,Recall@1达到93.95%,而微调CLIP仅为0.22%。这些结果表明,我们的多模态训练方法有效捕捉了医学领域的跨模态关系,既实现了卓越的检索能力,又具备稳健的诊断性能。