High-performance deep learning methods typically rely on large annotated training datasets, which are difficult to obtain in many clinical applications due to the high cost of medical image labeling. Existing data assessment methods commonly require knowing the labels in advance, which are not feasible to achieve our goal of 'knowing which data to label.' To this end, we formulate and propose a novel and efficient data assessment strategy, EXponentiAl Marginal sINgular valuE (EXAMINE) score, to rank the quality of unlabeled medical image data based on their useful latent representations extracted via Self-supervised Learning (SSL) networks. Motivated by theoretical implication of SSL embedding space, we leverage a Masked Autoencoder for feature extraction. Furthermore, we evaluate data quality based on the marginal change of the largest singular value after excluding the data point in the dataset. We conduct extensive experiments on a pathology dataset. Our results indicate the effectiveness and efficiency of our proposed methods for selecting the most valuable data to label.
翻译:高性能深层学习方法通常依赖大量附加说明的培训数据集,由于医疗图像标签成本高,许多临床应用都难以获得这些数据。现有的数据评估方法通常要求事先了解标签,而实现“了解哪些数据标签”的目标并不可行。 为此,我们制定和提出一个创新和有效的数据评估战略,即ExponentiAl Marginal sINGINAL valuE(EXAMINE)评分,以便根据通过自我监督学习(SSL)网络提取的无标签医疗图像数据有用的潜在表现,对数据的质量进行排序。我们受SSL嵌入空间的理论影响,我们利用蒙蔽自动编码器进行特征提取。此外,我们在将数据集的数据点排除在外之后,根据最大单值的边际变化评估数据质量。我们在一个病理数据集上进行了广泛的实验。我们的结果表明,我们提议的选择最有价值的数据标签的方法的有效性和效率。