项目名称: 基于集成学习的网页链接作弊检测
项目编号: No.61300190
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 刘馨月
作者单位: 大连理工大学
项目金额: 22万元
中文摘要: 网页作弊给搜索引擎和用户带来巨大损失,尤以链接作弊危害严重。半自动链接作弊检测算法传播人工判别的作弊信息,但忽略了作弊网页的很多特征;自动算法使用作弊网页的部分特征,通过机器学习、图正规化等进行检测,但忽略了其它特征和人工判别能力。总之,现有算法使用信息不够全面,检测能力较弱,性能已遭遇瓶颈。针对现有算法的困难,本项目在我们前期大量研究基础上,使用集成学习理论解决链接作弊检测问题。首先提出自动作弊检测算法集成方案,充分利用作弊网页各种特征,集成各种算法的检测能力;其次提出信任和不信任同步传播策略,充分利用好种子和坏种子的信息,集成信任传播和不信任传播的检测能力;最后提出自动算法和半自动算法集成方案,将作弊网页的统计特征和人工判别能力充分结合,全方位挖掘各类信息用于作弊检测。本项研究将形成基于集成学习的链接作弊检测较完善的理论体系,克服片面信息和单一算法的困难,使作弊检测精度获得实质性提高。
中文关键词: 作弊检测;信任传播;不信任传播;多视角;集成学习
英文摘要: Web page spam causes huge losses to both search engine providers and users, and link spam is the most harmful. Semi-automatic link spam detection algorithms propagate human-identified spam information, but neglect many features of spam pages; Automatic algorithms use partial features and detect spam with machine learning or graph regulation techniques, but neglect other features and human judgement. In brief, existing algorithms can not make use of overall information, show weak detection abilities and meet with performance bottleneck. Based on our large amount of previous reseach, in this project, we use ensemble learning theory to solve the link spam detection problem. Firstly, we propose ensemble schemes of automatic spam detection algorithms, which make full use of spam page features and integrate the detection abilities of all kinds of automatic algorithms; Secondly, we propose synchronous propagation schemes of trust and distrust, which make full use of information provided by both good and bad seeds, and integrate the abilities of both trust propagation and distrust propagation; Finally, we propose combination schemes of automatic and semi-automatic algorithms, which integrate statical features of spam pages and human judgements,thus fuse all kinds of information for spam detection. With efforts made dur
英文关键词: spam detection;trust propagation;distrust propagation;multiple views;ensemble learning