Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text (like radiological reports) during pre-training improves the results even further. Still, most existing methods target image classification downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), to our best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on an evaluation framework of 18 localized tasks on chest X-rays from five public datasets. LoVT performs best on 10 of the 18 studied tasks making it the preferred method of choice for localized tasks.
翻译:事实证明,对未贴标签数据的培训前图像模型来说,反向学习是有效的,在医学图像分类等任务方面有希望的结果。在培训前使用配对文本(如辐射报告)可以进一步改善结果。不过,大多数现有方法都针对图像分类下游任务,可能不适合语义分解或物体探测等局部任务。因此,我们建议根据我们的最佳知识,从视觉和文字(LOVT)学习地方化代表方法(LOVT),这是针对局部医疗图像任务的第一个由文字监督的培训前方法。我们的方法是将实例级图像报告对比学习与当地图像区域对比性学习和报告句子演示相结合。我们评估LOVT和常用的培训前方法,用于评估5个公共数据集的18个胸X射线局部任务。LOVT在18个研究任务中的10个任务中表现最佳,使其成为本地化任务的首选方法。