Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, even imbalanced datasets, without any further ML stage. The proposed algorithm built a filtered simplicial complex on the dataset. Persistent homology is then applied to guide choosing a sub-complex where unlabeled points obtain the label with most votes from labeled neighboring points. To assess the proposed method, 8 datasets were selected with several degrees of class entanglement, variability on the samples per class, and dimensionality. On average, the proposed TDABC method was capable of overcoming baseline classifiers (wk-NN and k-NN) in each of the computed metrics, especially on classifying entangled and minority classes.
翻译:地形数据分析(TDA)是一个新兴领域,目的是发现数据集中隐藏的地形信息。 TDA工具通常用来创建过滤器和地形描述器,以改进机器学习方法。本文件提出一种算法,直接将TDA应用于多级分类问题,甚至不平衡的数据集,而没有进一步的 ML 阶段。 拟议的算法在数据集上建立了一个过滤的简化综合体。 然后,在未标记的点从标签的邻近点获得最多选票的标签时,采用持久性同质法来指导选择子组合。 为了评估拟议方法,选择了8个数据集,分等级纠缠、每类样本的变异性和多维度。 平均而言,拟议的TDABC方法能够克服每个计算指标中的基线分类器(wk-NN和k-NNN),特别是在对缠绕和少数群体分类方面。