项目名称: 基于半监督集成学习的不平衡数据研究
项目编号: No.61203292
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 自动化学科
项目作者: 陈欢欢
作者单位: 中国科学技术大学
项目金额: 24万元
中文摘要: 多类别数据不平衡问题(即数据中的一类样本在数量上远多于另一类或几类)广泛存在于各种实际应用中。传统的学习算法容易对大类过分重视,进而导致分类器在小类别数据上精度很低。采样方法作为一种重要的平衡数据集的手段受到了研究者广泛的重视。本课题主要针对现有采样方法机制单一、缺少容错机制等问题,提出了一种基于多假设的采样方法,通过只采样数据而不指定数据类别的方式从半监督学习的角度来解决多类别不平衡这个特殊的监督学习问题;提出了一种基于协作型半监督集成学习的不平衡数据处理方法,加深了对集成学习模型的理解、扩展了其应用范围;并且将理论研究成果直接应用于实际的生物信息学问题。
中文关键词: 不平衡学习;模型空间中的学习;计算智能;大数据;
英文摘要: Many real-world machine learning applications are characterized as imbalanced classification problems, where there are many more instances of some classes than others. For this kind of classification problems, the typical classifiers are prone to ignore the small classes, which lead to inferior performance on small classes. As an important approach to tackle class imbalanced problems, the resampling methods have been paid a lot of attention. However, the existing resampling methods always assign the "assumed" labels to new sampled data and do not have the robust approach for different types of data in real-world applications. To address these problems, this proposed project will investigate the class imbalanced problem from a semi-supervised learning perspective, which generates unlabelled synthetic data from minority classes and uses both labelled and unlabelled data to build better classifiers by multiple assumption based sampling approaches. The proposed project proposes to employ collaborative semi-supervised ensemble methods to address the imbalanced problems, which leads to better understanding of ensemble model and extended application domains of ensemble model. In addition, the proposed research will incorporate the theoretical research results to the real-world Bioinformatics problems.
英文关键词: Imbalanced Learning;Learning in the Model Space;Computational Intelligence;Big Data;