Along with the flourish of the information age, massive amounts of data are generated day by day. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in practical applications. Therefore, an efficient big data analytics method is urgently needed. For feature engineering, feature selection seems to be an important research content in which is anticipated to select "excellent" features from candidate ones. Different functions can be realized through feature selection, such as dimensionality reduction, model effect improvement, and model performance improvement. In many classification tasks, researchers found that data seem to be usually close to each other if they are from the same class; thus, local compactness is of great importance for the evaluation of a feature. In this manuscript, we propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS), to select desired features. To demonstrate the efficiency and accuracy, several data sets are chosen with extensive experiments being performed. Later, the effectiveness and superiority of our method are revealed through addressing clustering tasks. Here, the performance is indicated by several well-known evaluation metrics, while the efficiency is reflected by the corresponding running time. As revealed by the simulation results, our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
翻译:随着信息时代的蓬勃发展,每天生成大量数据。由于这些数据的大规模和高度特征,往往难以在实际应用方面实现更好的决策。因此,迫切需要一种高效的大数据分析方法。对于特征工程,特征选择似乎是一个重要的研究内容,其中预期从候选的特征中选择“优秀”特征。通过特征选择,可以实现不同的功能,如尺寸降低、模型效果改进和模型性能改进等。在许多分类任务中,研究人员发现,如果数据来自同一类别,数据通常会相互接近;因此,地方紧凑对于某个特征的评估非常重要。在这个手稿中,我们提出了一个快速不受监督的特征选择方法,名为“压缩评分”(CSUFS),以选择想要的特征。为了显示效率和准确性,通过广泛实验选择了几个数据集。后来,我们的方法的有效性和优越性通过处理集群任务得到了披露。在这里,一些著名的评估指标显示,因此,地方缩缩缩略语对某个特征的评价非常重要。在本手稿中,效率似乎通过相应的演算得到反映,而效率则通过相应的算法得到反映。