Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, without any further ML stage, showing advantages for imbalanced datasets. The proposed algorithm builds a filtered simplicial complex on the dataset. Persistent Homology (PH) is applied to guide the selection of a sub-complex where unlabeled points obtain the label with the majority of votes from labeled neighboring points. We select 8 datasets with different dimensions, degrees of class overlap and imbalanced samples per class. On average, the proposed TDABC method was better than KNN and weighted-KNN. It behaves competitively with Local SVM and Random Forest baseline classifiers in balanced datasets, and it outperforms all baseline methods classifying entangled and minority classes.
翻译:地形数据分析(TDA)是一个新兴领域,旨在发现数据集中隐藏的地形信息。TDA工具通常用来创建过滤器和地形描述器,以改进机器学习方法。本文提出一种算法,直接将TDA应用于多级分类问题,而没有进一步的 ML 阶段,显示不平衡数据集的优势。提议的算法在数据集上建立一个过滤的简化综合体。在未标点以标签邻近点的多数票数获得标签的情况下,应用了持久性有机污染物来指导亚复合体的选择。我们选择了8个具有不同层面、类别重叠程度和每类不平衡样本的数据集。平均来说,拟议的TDABC方法优于KNN和加权-KNN。它在平衡的数据集中与本地SVM和随机森林基线分类器竞争,它超越了所有将缠绕和少数群体分类的基线方法。