In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel triplet extraction module to extract the medical-related information, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel triplet encoding module with entity translation by querying a knowledge base, to exploit the rich domain knowledge in medical field, and implicitly build relationships between medical entities in the language embedding space; Third, we propose to use a Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level, enabling the ability for medical diagnosis; Fourth, we conduct thorough experiments to validate the effectiveness of our architecture, and benchmark on numerous public benchmarks, e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding.
翻译:在本文中,我们考虑利用每日放射实践的配对图像文本报告,用具体领域的知识加强医学视觉学前培训(VLP),我们考虑加强医学视觉学前培训(VLP),利用来自放射性的日常做法,特别是作出以下贡献:首先,与直接处理原始报告的现有工作不同,我们采用了一个新的三重提取模块,以提取与医疗有关的信息,避免语言语法不必要的复杂,并加强监督信号;第二,我们提出与实体翻译的新颖的三重编码模块,通过查询知识库,利用医学领域的丰富领域知识,并在语言嵌入空间内医学实体之间隐含地建立关系;第三,我们提议使用基于变压器的聚合模型,在空间上将实体描述与图像补补补层的视觉信号统一起来,使医疗诊断能力;第四,我们进行彻底的实验,以验证我们架构的有效性,并参照许多公共基准,例如,ChestX-ray14,RSNA Pneemonia、SIIM-ACR-PNEMORTRA-2,COXRVI农村模型和EDEVSVSD,以及EMEMASVS-G-IG-GVIDRIG-G-GLVID-ID-GVIDRVIDRVIDRVID-I)和EVID-ED-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-IGVVVVVID-ID-IGVVVVIGVD-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-ID-EVID-ID-ID-ID-ID-ID-ID-ID-ID-ID-EVID-IGVID-ID-EVID-ID-ID-EVID-ID-ID-I-I-I-EVID-EVID-ID-EVID-EVID-EVID-ID-ID-ID-I</s>