In Computed Tomography, machine learning is often used for automated data processing. However, increasing model complexity is accompanied by increasingly large volume datasets, which in turn increases the cost of model training. Unlike most work that mitigates this by advancing model architectures and training algorithms, we consider the annotation procedure and its effect on the model performance. We assume three main virtues of a good dataset collected for a model training to be label quality, diversity, and completeness. We compare the effects of those virtues on the model performance using open medical CT datasets and conclude, that quality is more important than diversity early during labeling; the diversity, in turn, is more important than completeness. Based on this conclusion and additional experiments, we propose a labeling procedure for the segmentation of tomographic images to minimize efforts spent on labeling while maximizing the model performance.
翻译:在计算机断层扫描领域,机器学习经常用于自动化数据处理。然而,随着模型复杂度越来越高,数据集的体积也随之增加,进而增加了模型训练的成本。与大多数通过推进模型架构和训练算法来缓解这一问题的研究不同,我们考虑标注过程及其对模型性能的影响。我们假设用于模型训练的好数据集的三个主要特征是标签质量、多样性和完备性。我们使用开放式的CT医学数据集比较这些特性对模型性能的影响,并得出结论:在标注早期,质量比多样性更重要;而多样性比完备性更重要。基于这个结论和额外的实验,我们提出了一种用于断层图像分割的标注程序,以最小化标注所需的工作量,同时最大化模型性能。