Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (imagetext pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs of accurate localization of abnormalities for CAD in CXR. However, we find that the formulation proposed by locality-aware VLP literatures actually leads to loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, ELVIS is able to focus well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.
翻译:深度学习已经在协助放射科医生阅读胸部X射线图像方面显示出巨大的潜力,但是需要昂贵注释以提高性能的需求阻碍了其在临床中的广泛应用。视觉语言预训练方法可以通过利用常规生成的放射照片报告来缓解注释的负担和成本,这些报告以大量成对形式存在(图像-文本对)。此外,针对CAD在胸部X射线中精确定位异常所需的需求,我们正在提出扩展的位置感知(VLP)方法。然而,我们发现位置感知VLP文献中提出的公式实际上会导致下游定位任务所需的空间关系丢失。因此,我们提出了 Empowering Locality of VLP with Intra-modal Similarity (ELVIS),这是一种具有内模态局部性感知的VLP方法,以更好地保留放射照片或报告内部的局部性,从而增强理解文本报告中的位置引用的能力。我们的位置感知的VLP方法在多个分割任务和MS-CXR短语定位任务中明显优于最先进的基线方法。定性地说,与以前的方法相比,ELVIS能够很好地关注文本描述的感兴趣区域,从而增强了可解释性。