你只需要更多数据吗? (Is More Data All You Need? A Causal Exploration)

Curating a large scale medical imaging dataset for machine learning applications is both time consuming and expensive. Balancing the workload between model development, data collection and annotations is difficult for machine learning practitioners, especially under time constraints. Causal analysis is often used in medicine and economics to gain insights about the effects of actions and policies. In this paper we explore the effect of dataset interventions on the output of image classification models. Through a causal approach we investigate the effects of the quantity and type of data we need to incorporate in a dataset to achieve better performance for specific subtasks. The main goal of this paper is to highlight the potential of causal analysis as a tool for resource optimization for developing medical imaging ML applications. We explore this concept with a synthetic dataset and an exemplary use-case for Diabetic Retinopathy image analysis.

翻译：为机器学习应用提供大规模医学成像数据集需要时间和费用。模型开发、数据收集和说明之间的平衡工作量对于机器学习实践者来说是困难的,特别是在时间有限的情况下。医学和经济学中经常使用因果分析来了解行动和政策的影响。本文我们探讨了数据集干预对图像分类模型产出的影响。我们通过因果方法调查了我们需要纳入数据集的数据的数量和类型的影响,以便实现特定子任务更好的性能。本文的主要目的是强调因果分析作为开发医学成像 ML 应用程序的资源优化工具的潜力。我们用合成数据集和糖尿病雷蒂诺病图象分析的示范性使用案例来探索这一概念。