项目名称: 基于网络异构文本数据融合的热点话题发现及其内容摘要研究
项目编号: No.61273278
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 李素建
作者单位: 北京大学
项目金额: 80万元
中文摘要: 热点话题发现及内容摘要是具有实际应用价值的研究课题。但热点话题发现、话题内容摘要往往作为两项独立的研究,通常采用单一类型语料,且算法难以有新突破。由此本申请提出融合新闻网站、维基、微博等异构网络数据,将热点话题发现和话题内容摘要两项任务结合起来。研究内容包括:(1)基于多维度标注的异构文本数据融合。研究如何从时间、篇章、内容、用户等维度对文本进行标注,在注意的特征整合理论指导下对数据进行融合;(2)基于融合数据的热点话题发现和追踪。研究利用融合数据在各个维度的特征进行话题分析和结构表示,结合用户和媒体的关注度计算话题的热点度,并研究如何利用话题结构收集热点话题相关的描述文档;(3)基于维基数据的热点话题内容摘要。鉴于维基平台具有领域全面性、对已发生的热点话题给出综述的特点,研究如何利用同类话题的维基文档集合挖掘话题内容表达的共性,获取话题模板以改善摘要性能,从而突破基于句子抽取的摘要方法。
中文关键词: 社会媒体;话题发现;自动摘要;篇章分析;深度学习
英文摘要: It is fundamental and practical to detect and summarize hot topics. However, the state-of-the-art researches always take hot topic detection and automatic summarization as two independent tasks, both of which make little improvement on proposing new methodology in recent years and are mainly relying on a single type of corpus. Thus, we propose to fuse the hetero-structural text streams from news websites, Wikimedia and microblog platforms, and simultaneously merge the tasks of hot topic detection and automatic summarization. This proposal will include the following three aspects: (1) To research on hetero-structural text data fusion based on multi-dimension labeling and indexing. Some dimensions such as time, discourse, content and user will be designed for the text labeling, and the feature integration theory of attention will be introduced to guide the data fusion. (2) To research on the topic detection and tracking based on the fused data. We will investigate and research how to represent and analyze one topic according to the multi-dimension labeling results. Then the hotness degree of one topic needs to be measured with the consideration of both public concern and media concern. Furthermore, the related documents which describe the hot topics are collected using the corresponding topic structure. (3) To res
英文关键词: Social media;Topic detection;Automatic summarization;Discourse parsing;Deep learning