Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses additional challenges in vision--language modelling compared to the general domain, and previous work has used insufficiently adapted models that lack domain-specific language understanding. In this paper, we show that principled textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing. We release a language model that achieves state-of-the-art results in radiology natural language inference through its improved vocabulary and novel language pretraining objective leveraging semantics and discourse characteristics in radiology reports. Further, we propose a self-supervised joint vision--language approach with a focus on better text modelling. It establishes new state of the art results on a wide range of publicly available benchmarks, in part by leveraging our new domain-specific language model. We release a new dataset with locally-aligned phrase grounding annotations by radiologists to facilitate the study of complex semantic modelling in biomedical vision--language processing. A broad evaluation, including on this new dataset, shows that our contrastive learning approach, aided by textual-semantic modelling, outperforms prior methods in segmentation tasks, despite only using a global-alignment objective.
翻译:在生物医学(如放射图象和报告)中,多式数据涉及生物医学,例如放射图象和报告。在规模上解释这一数据对于改善临床护理和加速临床研究至关重要。生物医学文本及其复杂的语义与一般领域相比,在视觉语言建模方面提出了更多的挑战。生物医学文本及其复杂的语义与一般领域相比在视觉语言建模方面构成额外的挑战,而以前的工作则使用了不完全适应的模型,缺乏对特定领域语言的理解。在本文中,我们表明,原则文字语义建模可以大大改善自我监督的视觉语言处理过程中的对比性学习。我们发布了一种语言模型,通过改进词汇和新语言预培训目标,利用放射学报告中的语义和语义特征。我们提出了一种自我监督的联合视觉语言方法,重点是更好的文本建模。我们利用新的域语言处理模式,我们只发布一种与本地兼容的词组基础说明,以便利研究生物医学-视觉-语言建模过程中的复杂语义建模模型,通过先前的文字平面分析方法,通过广泛的评估,展示我们以往的文字建模方法。