Data imbalance remains one of the open challenges in the contemporary machine learning. It is especially prevalent in case of medical data, such as histopathological images. Traditional data-level approaches for dealing with data imbalance are ill-suited for image data: oversampling methods such as SMOTE and its derivatives lead to creation of unrealistic synthetic observations, whereas undersampling reduces the amount of available data, critical for successful training of convolutional neural networks. To alleviate the problems associated with over- and undersampling we propose a novel two-stage resampling methodology, in which we initially use the oversampling techniques in the image space to leverage a large amount of data for training of a convolutional neural network, and afterwards apply undersampling in the feature space to fine-tune the last layers of the network. Experiments conducted on a colorectal cancer image dataset indicate the usefulness of the proposed approach.
翻译:数据不平衡仍然是当代机器学习的公开挑战之一,在医学数据(如组织病理学图像)方面尤为普遍。传统的数据处理数据不平衡的方法不适合图像数据:过度抽样方法(如SMOTE及其衍生物)导致产生不切实际的合成观测,而过低抽样则减少可用数据的数量,这对成功培训进化神经网络至关重要。为了减轻与超量和低量抽样相关的问题,我们提议了一种新型的两阶段抽样方法,我们首先利用图像空间的过度抽样技术来利用大量数据来培训卷发神经网络,然后在特征空间应用下层抽样来微调网络的最后一层。在彩虹癌图像数据集上进行的实验显示了拟议方法的效用。