Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.
翻译:医用视觉和语言培训前模块(Med-VLP)由于适用于从医疗图像和文本中提取通用视觉和语言演示材料,因此受到相当重视,大多数现有方法主要包括三个要素:单式编码器(即视觉编码器和语言编码器)、多式混合模块和托辞任务,很少研究医疗领域专家知识的重要性,并明确利用这种知识促进Med-VLP。尽管在一般领域存在知识强化的视觉和语言培训前模块(VLP)方法,但大多数方法需要现成的工具包(例如目标探测器和场景图解解解码器),这在医疗领域是没有的。在本文件中,我们建议采取系统和有效的方法,从三个角度有条理的医学知识加强医疗-语言整合模块。首先,考虑到知识可以被视为愿景和语言之间的中间媒介,我们通过知识将所有愿景编码的表述和语言整合器(VLP)用于所有愿景和语言培训前模块。第二,我们将最先进的知识引入多式的图像分析,包括将我们最精细的图像分析的精细的精细的流程,将我们用于进一步的文本的推。