Breast cancer is among the most deadly diseases, distressing mostly women worldwide. Although traditional methods for detection have presented themselves as valid for the task, they still commonly present low accuracies and demand considerable time and effort from professionals. Therefore, a computer-aided diagnosis (CAD) system capable of providing early detection becomes hugely desirable. In the last decade, machine learning-based techniques have been of paramount importance in this context, since they are capable of extracting essential information from data and reasoning about it. However, such approaches still suffer from imbalanced data, specifically on medical issues, where the number of healthy people samples is, in general, considerably higher than the number of patients. Therefore this paper proposes the $\text{O}^2$PF, a data oversampling method based on the unsupervised Optimum-Path Forest Algorithm. Experiments conducted over the full oversampling scenario state the robustness of the model, which is compared against three well-established oversampling methods considering three breast cancer and three general-purpose tasks for medical issues datasets.
翻译:乳腺癌是全世界最致命的疾病之一,大多数是妇女。尽管传统的检测方法表明自己对这项任务是有效的,但是它们仍然普遍呈现出较低的理解度,需要专业人员投入大量的时间和精力。因此,一个能够提供早期检测的计算机辅助诊断(CAD)系统变得非常可取。在过去十年中,机器学习技术在这方面至关重要,因为它们能够从数据和关于这些数据的推理中提取基本信息。然而,这种方法仍然受到不平衡数据的影响,特别是在医疗问题上,因为健康人群的样本数量一般都大大高于患者数量。因此,本文件建议采用美元=2PF,这是一种基于不受监督的“最佳-帕特姆森林藻类”的数据过度抽样方法。在全面过度抽样假设中进行的实验表明模型的稳健性,与考虑到三种乳腺癌和医疗问题数据集的三项一般用途任务的三种既定的过度抽样方法相比较。