Training segmentation models for medical images continues to be challenging due to the limited availability and acquisition expense of data annotations. Segment Anything Model (SAM) is a foundation model trained on over 1 billion annotations, predominantly for natural images, that is intended to be able to segment the user-defined object of interest in an interactive manner. Despite its impressive performance on natural images, it is unclear how the model is affected when shifting to medical image domains. Here, we perform an extensive evaluation of SAM's ability to segment medical images on a collection of 11 medical imaging datasets from various modalities and anatomies. In our experiments, we generated point prompts using a standard method that simulates interactive segmentation. Experimental results show that SAM's performance based on single prompts highly varies depending on the task and the dataset, i.e., from 0.1135 for a spine MRI dataset to 0.8650 for a hip x-ray dataset, evaluated by IoU. Performance appears to be high for tasks including well-circumscribed objects with unambiguous prompts and poorer in many other scenarios such as segmentation of tumors. When multiple prompts are provided, performance improves only slightly overall, but more so for datasets where the object is not contiguous. An additional comparison to RITM showed a much better performance of SAM for one prompt but a similar performance of the two methods for a larger number of prompts. We conclude that SAM shows impressive performance for some datasets given the zero-shot learning setup but poor to moderate performance for multiple other datasets. While SAM as a model and as a learning paradigm might be impactful in the medical imaging domain, extensive research is needed to identify the proper ways of adapting it in this domain.
翻译:对于医学图像分割模型的训练仍然存在困难,原因是数据标注的可用性和获取成本的限制。将Segment Anything Model(SAM)模型应用于交互式分割用户定义的感兴趣对象,该模型是在上亿的注释中训练出来的,主要用于自然图像。尽管它在自然图像上有着出色的性能,但该模型在转移到医学图像领域时会受到影响。本文在11个医学影像数据集上进行了SAM在医学影像分割方面的实验评估,这些数据集来自不同的成像技术和解剖学。在实验中,我们使用了一种标准方法来生成点提示,并模拟交互式分割。实验结果表明,对于不同任务和数据集,SAM根据单个提示的表现存在巨大差异,即在IoU评估中,脊柱MRI数据集为0.1135,髋部X线数据集为0.8650。当提供多个提示时,总体表现略有改善,但对于对象不连续的数据集,改进更明显。通过与RITM的比较, 对于单个提示,SAM的性能要好得多,而对于较多提示的情况,两种方法的性能相似。我们得出结论,SAM在一些数据集中的性能表现令人印象深刻,但对于其他数据集的表现为中等至差,并需要进行进一步的研究以确定在医学图像领域的适应方法。