A fundamental task in machine learning involves visualizing high-dimensional data sets that arise in high-impact application domains. When considering the context of large imbalanced data, this problem becomes much more challenging. In this paper, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is used to reduce the dimensions of an earthquake engineering related data set for visualization purposes. Since imbalanced data sets greatly affect the accuracy of classifiers, we employ Synthetic Minority Oversampling Technique (SMOTE) to tackle the imbalanced nature of such data set. We present the result obtained from t-SNE and SMOTE and compare it to the basic approaches with various aspects. Considering four options and six classification algorithms, we show that using t-SNE on the imbalanced data and SMOTE on the training data set, neural network classifiers have promising results without sacrificing accuracy. Hence, we can transform the studied scientific data into a two-dimensional (2D) space, enabling the visualization of the classifier and the resulting decision surface using a 2D plot.
翻译:机器学习的一项根本任务涉及对高影响应用领域产生的高维数据集进行可视化。在考虑大型不平衡数据的背景时,这一问题变得更加棘手。在本文中,为可视化目的,将分散的蒸汽邻居嵌入(t-SNE)算法用于减少地震工程相关数据集的维度。由于不平衡数据集严重影响了分类器的准确性,因此我们使用合成少数群体过度采样技术(SMOTE)来解决这类数据集的不平衡性。我们介绍了从t-SNE和SMOTE获得的结果,并将其与各个方面的基本方法进行比较。考虑到四个选项和六个分类算法,我们表明,在培训数据集上,使用关于不平衡数据的t-SNEE和SMOTE,神经网络分类器在不牺牲准确性的前提下有希望的结果。因此,我们可以将研究过的科学数据转换为二维(2D)空间,使分类器和由此产生的决定表面能够以2D图图进行可视化。