项目名称: 面向微博数据流的事件主线挖掘技术研究
项目编号: No.61303156
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 杜攀
作者单位: 中国科学院计算技术研究所
项目金额: 26万元
中文摘要: 微博信息具有更新快、长度短、碎片化、多而杂和大量冗余的特点,这些特点带来了三个问题:(1)信息过载(2)信息碎片(3)信息冗余。事件主线(Storyline)使用有限的几条微博就能够描述新闻事件发展的所有关键片段(Episode),是解决上述问题的有效途径。面向微博数据流的事件主线挖掘技术,主要面临如下几个挑战:(1)高维稀疏的微博数据带来的相似关系计算问题(2)事件关键片段的事件相关性目标、内容重要性目标、信息差异性目标的优化平衡。(3)海量微博流式更新对主线识别算法性能的挑战。针对上述挑战,本课题从分析利用微博数据丰富的关联关系入手,将事件主线识别问题归结为基于关系流形的多目标排序问题,并研究应对微博数据流式更新的增量式多目标排序算法。本课题的研究立足于面向微博数据流的事件主线挖掘,既有重要的研究价值,又有广阔的应用前景,将为网络舆情分析、自动文摘、产业调研等应用提供关键技术支持。
中文关键词: 事件主线挖掘;关键片段识别;主线更新;微博流数据;子话题标签生成
英文摘要: Information on the microblog is usually short, fragmentary, abundant, noisy, and heavily repeated, which causes three serious problems: 1) information overload, 2) information fragmentation, and 3)information redundant. Event Storyline selects several microblog posts out to represent all the episodes during the event development, hence is a great way to solve the problems above. There are mainly three challenges for mining event storyline on microblogging stream: 1) The data sparsity and high dimensionality causes poor performance on similarity measurement, 3) The balance of relevance, prestige and diversity in episode mining is non-trival, 3) The constantly updated microblog data requirs a more efficient mining algorithm to update the storyline correspondingly. Focusing on the chanllenges described above, we propose to make full use of the rich relationships among microblog posts to learn a similarity metric, then adopt an incremental multi-objective ranking algorithm to identify the posts which can represent the important episodes during event development. Our study focuses on mining event storylines on microblogging streams. It gives support to many important applications such as online public opinion analysis, automatic summarization, industry survey and so on.
英文关键词: Storyline Mining;Key Episode Recognition;Storyline Update;Microblog Stream;Subtopic Tag Generation