项目名称: 基于词义的文档表示模型及多语亚文档主题分析研究
项目编号: No.61272233
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 夏云庆
作者单位: 清华大学
项目金额: 82万元
中文摘要: 互联网加速发展,给以主题分析为核心的舆情监测带来了机遇和挑战。主题分析研究目前主要面临三个难题:(1)基于词汇或词簇的文档表示模型不能有效处理一词多义和多词同义等语义现象,因而不能精确表示文档。(2)面向整篇文档的主题分析方法无法应对文档多主题现象。(3)多语言/跨语言瓶颈问题日益突出。针对上述问题,本项目提出基于词义的文档表示模型(SCM),具有如下优越性:第一,一词多义和多词同义现象在文档表示阶段就能准确体现,文档表示更加精确。第二,基于词义的模型对文档长度具有鲁棒性,数据稀疏问题大大减弱。第三,以词义表示文档,多/跨语言处理潜力较强。本项目进而提出主题分析的两项创新性工作:一是面向亚文档的细粒度主题分析,使主题分析更加准确。二是多语言/跨语言主题分析,提高主题分析的国际化能力。本项目的顺利完成,将进一步推进主题分析研究,提高我国舆情监测的水平,从而促进互联网健康发展。
中文关键词: 词义;文档表示模型;主题分析;跨语言;亚文档
英文摘要: The accelerating Internet brings opportunities as well as challenges to topic analysis based public sentiment monitoring industry. Three challenges are worth noting. First, word or word cluster is not able to represent semantic phenomena such as polysemy and synonym, leading to less accurate document representation. Second, document usually involves more than one topics, resulting less effective topic analysis. Third, multi-/cross-lingual issue becomes an outstanding bottleneck. In this project, word sense is explored and the sense cluster model (SCM) is proposed. By using word sense, SCM exhibits three advantages: 1) The polysemy and synonym phenomena are naturally addressed. 2) The model is robust in handling shorter documents. 3) The model shows excellent potential in processing multi-/cross-lingual documents. Based on SCM model, two tasks are planned in this project: 1) A fine-grained topic analysis method will be designed to handle subdocuments rather than documents, which results in more accurate topic analysis. Second, multi-/cross-lingual topic analysis method will be developed to alleviate language barrier for information access. Success of this project will boost research on topic analysis and development of public sentiment monitoring systems, and make contribution thereafter to a healthy Internet.
英文关键词: word sense;document representation model;topic analysis;cross-language;sub-document