Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text and images (such as radiological reports and images) during pre-training improved the results even further. Still, most existing methods target image classification as downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), to our best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on a novel evaluation framework consisting of 18 localized tasks on chest X-rays from five public datasets. While there is no single best method, LoVT performs best on 11 out of the 18 studied tasks making it the preferred method of choice for localized tasks.
翻译:在医学图像分类等任务中,使用配对文本和图像(如辐射报告和图像),在培训前进一步改进了结果。不过,大多数现有方法将图像分类作为下游任务,可能不适合进行语义分解或物体探测等局部任务。因此,我们建议根据我们的最佳知识,从视觉和文字(LOVT)学习本地化代表,这是针对本地化医学成像任务的第一个由文字监督的培训前方法。我们的方法是将实例一级的图像报告对比性学习与本地图像对比性学习相结合,同时在图像区域和句子描述方面进行本地化对比性学习。我们评估LOVT,并普遍使用培训前方法,在由五个公共数据集中的X光胸部18个本地化任务组成的新评价框架上采用培训前方法。虽然没有单一的最佳方法,但LOVT在18项研究任务中的11项任务中表现最佳,因此成为本地化任务的首选方法。