Contrastive pretraining on parallel image-text data has attained great success in vision-language processing (VLP), as exemplified by CLIP and related methods. However, prior explorations tend to focus on general domains in the web. Biomedical images and text are rather different, but publicly available datasets are small and skew toward chest X-ray, thus severely limiting progress. In this paper, we conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude larger than existing biomedical image-text datasets such as MIMIC-CXR, and spans a diverse range of biomedical images. The standard CLIP method is suboptimal for the biomedical domain. We propose BiomedCLIP with domain-specific adaptations tailored to biomedical VLP. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even outperformed radiology-specific state-of-the-art models such as BioViL on radiology-specific tasks such as RSNA pneumonia detection, thus highlighting the utility in large-scale pretraining across all biomedical image types. We will release our models at https://aka.ms/biomedclip to facilitate future research in biomedical VLP.
翻译:关于平行图像文本数据的对比性培训在视觉语言处理(VLP)方面取得了巨大成功,如CLIP和相关方法所示。然而,先前的探索往往侧重于网络中的一般领域。生物医学图像和文本相当不同,但公开提供的数据集小,对胸前X光有偏差,从而严重限制进展。在本文中,我们进行了迄今为止最大的生物医学VLP研究,使用了普梅德中央省生物医学研究文章中提取的1 500万张图形插图配对。我们的数据集(PMC-15M)比MIMIC-CXR等现有的生物医学图像数据集大两个数量级,并且跨越生物医学图像的种类。标准CLIP方法对于生物医学领域来说并不理想,因此严重限制进步。我们用生物医学VLP对标准成型生物医学从检索到视觉解答(VQA)等标准生物医学成像任务进行了广泛的实验和对比研究。 生物科学化CLIP在生物医学前的大规模生物医学检测中建立了新的状态,例如生物医学前的BBLS-RIBS-S-S-SBSBSBS-CSBS-CSBS-BS-BS-BS-BS-BS-BS-C-BS-BS-S-S-BS-S-BS-BS-BS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-C-S-S-C-BVGL-C-S-C-BAR-S-S-S-S-C-S-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-BAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S</s>