Vision-and-language(V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V&L models in medical applications. In particular, we identify that the visual representation in general V&L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art (SOTA) while it is trained on a 9 times smaller dataset.
翻译:视觉和语言( V & L) 模型将图像和文字作为输入输入,并学习如何捕捉它们之间的关联。 先前的研究显示, 预先培训的 V & L 模型可以显著改善下游任务( 如视觉问答( VQA) ) 的模型性能。 但是, V & L 模型在医疗领域( 如X光图像和临床笔记) 应用时效果较差。 在本文中, 我们调查了在医疗应用中应用预先培训的 V & L 模型的挑战。 特别是, 我们发现一般 V & L 模型中的视觉代表不适于处理医疗数据。 为了克服这一限制, 我们提议BERTHop, 一种基于 PixelHop++ 和 FevisionBERT 的变压器模型, 以更好地捕捉这两种模式之间的关联。 对 OpenI 数据集的实验, 一种常用的色氏病诊断基准, 表明 BERTHOP 取得了98. 12% 的平均区域, 比状态( SOTAC) 高1.62% 。