Large, annotated datasets are not widely available in medical image analysis due to the prohibitive time, costs, and challenges associated with labelling large datasets. Unlabelled datasets are easier to obtain, and in many contexts, it would be feasible for an expert to provide labels for a small subset of images. This work presents an information-theoretic active learning framework that guides the optimal selection of images from the unlabelled pool to be labeled based on maximizing the expected information gain (EIG) on an evaluation dataset. Experiments are performed on two different medical image classification datasets: multi-class diabetic retinopathy disease scale classification and multi-class skin lesion classification. Results indicate that by adapting EIG to account for class-imbalances, our proposed Adapted Expected Information Gain (AEIG) outperforms several popular baselines including the diversity based CoreSet and uncertainty based maximum entropy sampling. Specifically, AEIG achieves ~95% of overall performance with only 19% of the training data, while other active learning approaches require around 25%. We show that, by careful design choices, our model can be integrated into existing deep learning classifiers.
翻译:在医学图像分析中,由于对大型数据集贴上标签的时间、成本和困难令人望而却步,大量附加说明的数据集在医学图像分析中并不普遍。无标签数据集更容易获得,而且在许多情况下,专家为一小部分图像提供标签是可行的。这项工作提出了一个信息理论积极学习框架,用以指导最佳选择未贴标签的集合图象,该集合图象将根据评价数据集的预期信息收益最大化进行标记。在两种不同的医学图像分类数据集上进行了实验:多级糖尿病复古病等级分类和多级皮肤腐蚀分类。结果显示,通过调整EIG以顾及类平衡,我们提议的调整后预期信息增益(AEIG)优于几个受欢迎的基线,包括基于多样性的核心Set和基于不确定性的最大酶采样。具体地说,EIG只实现整个性能的~95%,只有19%的培训数据,而其他积极学习方法则需要约25%。我们表明,通过仔细设计选择,我们的模式可以纳入现有的深层次学习者。