Creating a dataset for training supervised machine learning algorithms can be a demanding task. This is especially true for medical image segmentation since this task usually requires one or more specialists for image annotation, and creating ground truth labels for just a single image can take up to several hours. In addition, it is paramount that the annotated samples represent well the different conditions that might affect the imaged tissue as well as possible changes in the image acquisition process. This can only be achieved by considering samples that are typical in the dataset as well as atypical, or even outlier, samples. We introduce a new sampling methodology for selecting relevant images from a larger non-annotated dataset in a way that evenly considers both prototypical as well as atypical samples. The methodology involves the generation of a uniform grid from a feature space representing the samples, which is then used for randomly drawing relevant images. The selected images provide a uniform cover of the original dataset, and thus define a heterogeneous set of images that can be annotated and used for training supervised segmentation algorithms. We provide a case example by creating a dataset containing a representative set of blood vessel microscopy images selected from a larger dataset containing thousands of images.
 翻译:为培训受监督的机器学习算法而创建数据集可能是一项艰巨的任务。 这对于医学图像分解来说尤其如此,因为这项任务通常需要一名或多名专家进行图像注解,而仅为单一图像而创建地面真相标签则需要几个小时。 此外,最重要的是,附加说明的样本要很好地反映可能影响图像组织的不同条件以及图像获取过程中可能发生的变化。这只能通过考虑数据集中典型的样本以及非典型甚至外部样本来达到。我们采用新的抽样方法从较大的非附加说明数据集中选择相关图像,其方式要均衡地考虑原型和非典型样本。该方法涉及从代表样本的特征空间生成一个统一的网格,然后用于随机绘制相关图像。所选的图像为原始数据集提供了统一覆盖,从而界定了一套可作注释并用于培训受监督的分解算法的混合图像。我们提供了一个实例,通过创建数据集,包含一组具有代表性的血液容器微观图像,其中含有从大数据集中选出的数千张图像。