面向TB级大数据的分布式属性分层加权子空间聚类集成方法研究

项目名称： 面向TB级大数据的分布式属性分层加权子空间聚类集成方法研究

项目编号： No.61305059

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 陈小军

作者单位： 深圳大学

项目金额： 25万元

中文摘要： 大数据聚类的挑战重点体现在两个方面：一方面是数据的超高维性，这种超高维数据具有内在的稀疏性和聚类簇分布的子空间特性，使得绝大部分已有聚类算法失效；另一方面，庞大的对象数量导致庞大的数据量，串行的聚类算法难以对比单机内存大得多的数据进行聚类。针对以上挑战，本项目基于申请人在博士期间的研究成果，提出面向TB级大数据的分布式属性分层加权子空间聚类集成技术。研究内容包括：1）研究属性分组归并、单个属性与属性组分层加权的子空间聚类方法，解决超高维数据的聚类问题；2）结合聚类集成方法，研究属性分层加权子空间聚类集成算法，进一步优化超高维数据聚类的结果；3）针对超高维大数据聚类问题，在基于MapReduce的k-means软子空间算法实现的基础上，研发广度优先的分布式属性分层加权子空间聚类集成方法，达到TB级超高维大数据聚类的能力。预期成果将为大数据聚类分析提供新的理论工具及关键技术。

中文关键词： 聚类；子空间聚类；大数据；；

英文摘要： High-dimensional big data brings two challenges to the current data clustering technologies: very high-dimensionallity and massive objects. Such data is very sparse and often contains clusters in subspace, which makes most clustering methods inapplicable. Big data with massive objects cannot be clustered by serial clustering algorithms. To conquer the above two challenges, this project studies distributed variable layering subspace weighting cluster ensembles method for TB scale data. This research is a continuation of applicant's preliminary work in PhD study, which first proposed two-level variable weighting subspace clustering method for multi-view data. This project has three main tasks: 1) Develop methods to divide a large number of variables into a few groups and a variable layering subspace weighting clustering algorithm to solve the problem of very high-dimensional data clustering; 2) Develop a new cluster ensembles algorithm that uses the variable layering subspace weighting clustering algorithm for component clustering generation; 3) Develop a scalable distributed variable layering subspace weighting cluster ensembles algorithm based on the breadth-first strategy to enable TB scale data clustering. The expected delieverable will contribute to new theories and tools to solve large scale data clustering

英文关键词： Clustering；Subspace clustering；Big data；；

成为VIP会员查看完整内容