Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at~\url{https://github.com/zhjohnchan/M3AE}.
翻译:医学前视力和语言培训为从医疗图像和文本中获取有效的视觉和语言表现提供了一个可行的解决办法,从医疗图像和文本中获取有效的视觉和语言表现提供了一种可行的解决办法;然而,在这方面,几乎没有专门开展过多少研究,以促进对医学视觉和语言的理解;在本文件中,我们提出一个由多式蒙面自动读数器(M$3$AE)组成的自监督学习模式,该模式通过从随机蒙面图像和文本中重建缺失的像素和符号来学习跨模式域知识;有三种关键设计可以使这一简单方法发挥作用;首先,考虑到视觉和语言的不同信息密度,我们对投入图像和文本采用了不同的遮掩比例,在图像中使用了相当大得多的遮掩比例;第二,我们使用不同层次的视觉和文字特征来进行重建,以便处理视觉和语言解码前的不同程度;第三,我们开发了不同的视觉和语言设计(即愿景的变换器和语言的多层次的分辨器);第一,为了进行全面评估并便利进一步的研究,我们建立了医学-视觉/语言下游分析结构,我们制定了更好的医学-视觉/下游分析结果,包括三个实验任务。