Medical image datasets can have large number of images representing patients with different health conditions and various disease severity. When dealing with raw unlabeled image datasets, the large number of samples often makes it hard for experts and non-experts to understand the variety of images present in a dataset. Supervised learning methods rely on labeled images which requires a considerable effort by medical experts to first understand the communities of images present in the data and then labeling the images. Here, we propose an algorithm to facilitate the automatic identification of communities in medical image datasets. We further demonstrate that such analysis can be insightful in a supervised setting when the images are already labeled. Such insights are useful because, health and disease severity can be considered a continuous spectrum, and within each class, there usually are finer communities worthy of investigation, especially when they have similarities to communities in other classes. In our approach, we use wavelet decomposition of images in tandem with spectral methods. We show that the eigenvalues of a graph Laplacian can reveal the number of notable communities in an image dataset. Moreover, analyzing the similarities may be used to infer a spectrum representing the severity of the disease. In our experiments, we use a dataset of images labeled with different conditions for COVID patients. We detect 25 communities in the dataset and then observe that only 6 of those communities contain patients with pneumonia. We also investigate the contents of a colorectal cancer histology dataset.
翻译:医疗图像数据集可以有大量代表不同健康状况和不同疾病严重程度的患者的图像。在处理原始未贴标签的图像数据集时,大量样本往往使专家和非专家难以理解数据集中的各种图像。受监督的学习方法依赖于标签图像,这需要医学专家作出相当大的努力,首先了解数据中存在的图像群落,然后将图像标上标签。在这里,我们提出一种算法,以便利在医疗图像数据集中自动识别社区。我们进一步表明,当图像已经贴上标签时,这种分析可以在受监督的环境下有洞察力。这种洞察是有用的,因为健康和疾病严重程度可以被视为一种连续的频谱,在每一类中,通常都有更值得调查的精细社区,特别是当它们与其他类别中的社区有相似之处时。在我们的方法中,我们使用波片分解图像的配置与光谱方法相配合。我们显示,只有图中的癌症值才能在图像数据集中显示值得注意的社区的数量。此外,分析这些相似点可能被用来分析病人的相近点,因为健康和疾病严重程度,我们在25个社区中,我们用这些数据来测量这些病的频谱,我们用来测量。