生物医学图象分离:为自监督学习挖掘大型数据集 (Compound Figure Separation of Biomedical Images: Mining Large Datasets for Self-supervised Learning)

Tianyuan Yao,Chang Qu,Jun Long,Quan Liu,Ruining Deng,Yuanhan Tian,Jiachen Xu,Aadarsh Jha,Zuhayr Asad,Shunxing Bao,Mengyang Zhao,Agnes B. Fogo,Bennett A. Landman,Haichun Yang,Catie Chang,Yuankai Huo

from arxiv, Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://www.melba-journal.org/papers/2022:025.html. arXiv admin note: substantial text overlap with arXiv:2107.08650

With the rapid development of self-supervised learning (e.g., contrastive learning), the importance of having large-scale images (even without annotations) for training a more generalizable AI model has been widely recognized in medical image analysis. However, collecting large-scale task-specific unannotated data at scale can be challenging for individual labs. Existing online resources, such as digital books, publications, and search engines, provide a new resource for obtaining large-scale images. However, published images in healthcare (e.g., radiology and pathology) consist of a considerable amount of compound figures with subplots. In order to extract and separate compound figures into usable individual images for downstream learning, we propose a simple compound figure separation (SimCFS) framework without using the traditionally required detection bounding box annotations, with a new loss function and a hard case simulation. Our technical contribution is four-fold: (1) we introduce a simulation-based training framework that minimizes the need for resource extensive bounding box annotations; (2) we propose a new side loss that is optimized for compound figure separation; (3) we propose an intra-class image augmentation method to simulate hard cases; and (4) to the best of our knowledge, this is the first study that evaluates the efficacy of leveraging self-supervised learning with compound image separation. From the results, the proposed SimCFS achieved state-of-the-art performance on the ImageCLEF 2016 Compound Figure Separation Database. The pretrained self-supervised learning model using large-scale mined figures improved the accuracy of downstream image classification tasks with a contrastive learning algorithm. The source code of SimCFS is made publicly available at https://github.com/hrlblab/ImageSeperation.

翻译：随着自我监督学习(例如对比学习)的迅速发展,在医学图像分析中广泛认识到对培训更普遍的AI模型使用大型图像(即使没有说明)的重要性。然而,在规模上收集大规模任务专用的无附加说明数据对于单个实验室来说可能具有挑战性。现有的在线资源,如数字书籍、出版物和搜索引擎,为获取大规模图像提供了新的资源。然而,在医疗保健(例如放射学和病理学)中公布的图像包括大量带有子笔的复合数字。为了将复合数字提取和分离成可用于下游学习的大型个人图像,我们提议了一个简单的复合数字分离框架(SIMCFS),而不使用传统上要求的密封框说明,同时使用新的损失功能和硬案例模拟。我们的技术贡献是:(1) 我们引入一个基于模拟的培训框架,以最大限度地减少对资源模型绑定的框说明的需求;(2) 我们提出一个新的侧面数据损失源,以优化复合图分解;(3) 我们提议从内部的图像升级方法到模拟硬图案例;(4) 拟议的SlimCSimlial的自我分析结果的自我分析结果的升级,是最佳分析。