The text clustering technique is an unsupervised text mining method which are used to partition a huge amount of text documents into groups. It has been reported that text clustering algorithms are hard to achieve better performance than supervised methods and their clustering performance is highly dependent on the picked text features. Currently, there are many different types of text feature generation algorithms, each of which extracts text features from some specific aspects, such as VSM and distributed word embedding, thus seeking a new way of obtaining features as complete as possible from the corpus is the key to enhance the clustering effects. In this paper, we present a hybrid multisource feature fusion (HMFF) framework comprising three components, feature representation of multimodel, mutual similarity matrices and feature fusion, in which we construct mutual similarity matrices for each feature source and fuse discriminative features from mutual similarity matrices by reducing dimensionality to generate HMFF features, then k-means clustering algorithm could be configured to partition input samples into groups. The experimental tests show our HMFF framework outperforms other recently published algorithms on 7 of 11 public benchmark datasets and has the leading performance on the rest 4 benchmark datasets as well. At last, we compare HMFF framework with those competitors on a COVID-19 dataset from the wild with the unknown cluster count, which shows the clusters generated by HMFF framework partition those similar samples much closer.
翻译:文本组群技术是一种未经监督的文本挖掘方法,用来将大量文本文档分成一组。据报告,文本组群算法比监督方法很难取得更好的性能,而且其组群性性能高度取决于选定的文本特征。目前,有多种不同类型的文本特性生成算法,其中每种类型都从某些具体方面,如VSM和分布式字嵌入中提取文字特征,从而寻求从材料库中获取尽可能完整的特征的新方式,是增强组合效应的关键。在本文件中,我们提出了一个混合多来源特征聚合(HMFF)框架,由三个组成部分组成,即多模式的特征代表、相互相似的矩阵和特性聚合,其中我们为每个特性源构建了相互相似的矩阵,并结合了从相互相似的矩阵中产生的区别性特征,通过降低维度生成HMFF特性和分布式嵌入的文字组集法等,从而可以将K手段组群组合算算法配置成一个尽可能完整的组合体输入样本。实验测试显示,我们的HMFF框架比最近公布的11个公共基准数据集中的7个样本混合组合(HMIMIMMMMMB)的样本,并有领先的4级模型框架。我们比较了最接近的FF类组群群群群集的模型的模型,比较了最难点点点点点点点点化的模型。