Recent advances in vision-language pre-training have demonstrated astounding performances in diverse vision-language tasks, shedding a light on the long-standing problems of a comprehensive understanding of both visual and textual concepts in artificial intelligence research. However, there have been limited successes in the application of vision-language pre-training in the medical domain, as the current vision-language models and learning strategies for photographic images and captions are not optimal to process the medical data that are usually insufficient in the amount and the diversity. To address this, here we present medical X-VL, a novel model tailored for efficient vision-language pre-training that exploits cross attention in the radiological images and reports' common feature space in a symmetric manner. We experimentally demonstrate that the pre-trained medical X-VL model outperforms the current state-of-the-art models in various vision-language tasks in medical domains. We also demonstrate novel clinical usages in the diagnosis of newly emerging diseases and human error detection, which suggests the potential of the model for widespread applicability in different medical applications.
翻译:培训前的愿景显示,在各种视觉语言任务方面最近取得的进展令人吃惊,这说明在全面理解人工智能研究中的视觉和文字概念方面长期存在的问题,然而,在医疗领域应用视觉语言预培训方面,成绩有限,因为目前用于摄影图像和字幕的视觉语言模型和学习战略对于处理通常数量和多样性都不足的医疗数据并不理想。为了解决这个问题,我们在这里介绍了医疗X-VL,这是为高效的视觉语言预培训而专门设计的新型模型,它利用了辐射图像中的交叉关注,并以对称方式报告了共同特征空间。我们实验性地证明,预先培训的医学X-VL模型在医学领域各种视觉语言任务中超越了目前最先进的模型。我们还展示了诊断新出现疾病和人类错误检测的新临床用途,这表明了该模型在各种医疗应用中广泛应用的潜力。