项目名称: 面向社会化媒体异构大数据的快速组合聚类研究
项目编号: No.71471009
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 管理科学
项目作者: 李红
作者单位: 北京航空航天大学
项目金额: 60万元
中文摘要: 聚类分析是数据挖掘的关键技术。近年来,随着社会化媒体的蓬勃发展,以用户生成内容为核心的复杂异构大数据不断涌现,其海量、高维、异质等特征,对传统聚类分析方法提出了严峻的挑战。本项目尝试采用组合聚类来解决这个问题。已有研究表明,组合聚类可以获得比传统聚类更好的分析结果,但现有研究仍缺乏系统性,算法效率低、鲁棒性差,因此无法应对复杂异构大数据。有鉴于此,本项目将聚焦于社会化媒体异构大数据的组合聚类研究,具体包括:1)将二元矩阵法的高效性和多效用函数的适应性结合起来,在广义K-均值理论基础上,建立快速组合聚类的理论框架;2)紧密结合社会化媒体异构大数据特征,研究样本划分方法和并行计算策略;3)提出基础聚类分量权重学习方法,并解决存在样本不一致和聚类指导信息时的算法设计问题;4)研发系统原型,并针对社会化媒体的用户分群问题展开应用研究。本项目将为社会化媒体大数据时代的商务智能理论与实践提供有益参考。
中文关键词: 组合聚类;聚类分析;社会化媒体;大数据;社会计算
英文摘要: Cluster analysis is one of the core techniques in data mining area. As online social media emerge in a dramatic speed, there is a growing need for heterogeneous big data clustering, especially for the high-profile user generated content (UGC). The huge volume, high dimensionality and heterogeneity of such data post severe challenges to the traditional clustering algorithms. Our project aims to deal with this problem in an ensemble clustering way. While some existing studies have shown the advantages of ensemble clustering over many traditional clustering methods, these studies yet lack of theoretical foundations, and the low efficiency and/or poor robustness of the proposed methods further prevent them from being used for big data analysis. In light of this, in this project, we focus on designing fast ensemble clustering method for heterogeneous big data analysis with applications in social media. Specifically, 1) we attempt to combine the high efficiency of binary matrix clustering with the flexibility of multiple utility functions, to build a theoretical framework for fast ensemble clustering based on the generalized K-means algorithm; 2) We will explore the sample partitioning schemes for ensemble clustering, and parallel the computing loads according to the characteristics of social media data; 3) We will investigate the method of weight learning for basic partitionings, and extend the fast ensemble clustering algorithm to the circumstances with incomplete basic partitionings and/or link constraints in a graph; 4) We will finally develop a prototypical system, which will then be used for social media user grouping to showcase its effectiveness. The outcomes of this project will provide great values in terms of both theories and practices to business intelligence in the big-data era.
英文关键词: Ensemble Clustering;Cluster Analysis;Social Media;Big Data;Social Computing