项目名称: 基于短文本的知识库自动更新关键技术研究
项目编号: No.61472040
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 其他
项目作者: 宋丹丹
作者单位: 北京理工大学
项目金额: 84万元
中文摘要: 知识库对于知识的整理和利用具有重要意义,但传统的知识库更新由于依靠人工编辑导致内容滞后问题严重,使得知识库的自动更新成为研究热点。近年来快速增长的短文本数据因其具有海量性、实时性、信息特有性等优点成为知识库更新的一个重要数据来源。但是,由于短文本内容少、噪声多、表述多样、语法不规范,给自动更新过程带来很大挑战。 本项目研究基于短文本的知识库自动更新关键技术,具体包括:针对海量实时短文本的索引需求,构建引入密度的改进依存文法模型,提出面向实体信息的短文本可索引内容识别方法;研究稀疏特征空间上的可用特征扩展方法,通过引入时空信息进行有效特征扩展;基于有限标注数据,提出融合分类与排序目标的训练和分析方法,进行实体-短文本相关性分析;提出语义规则模板自学习算法,研究短文本中实体信息的自适应抽取。从而实现基于海量、实时、多样的短文本进行知识库自动更新的目标。
中文关键词: 知识库;短文本;知识挖掘;文本挖掘
英文摘要: Knowledge base is essential for knowledge management and utilization. But traditional knowledge bases are maintained manually by volunteer editors, which make them hard to keep up-to-date. Consequenctly, automatic update of knowledge bases becomes a hot research topic. In recent years, short texts are increasing rapidly, and as their information is massive, real-time, and specific, short texts become an important information source to update knowledge bases. However, short texts contains few contents and much noise, with various expressions and irregular grammars, thus bring big challenges for automatic updating process. In this project, we will research on key technologies for automatic updating of knowledge bases based on short texts. Firstly, facing the challenges in data storage and indexing, we will construct a tailored dependency grammar model with densities, and research on the identification for indexable contents of short texts. Secondly, we will propose a feature extension method for entities and short texts with temporal and spatial information incorporated, to solve the sparse problem of the feature space. Thirdly, with limited annotation data, we will give out a correlation discrimination method of entity and short text pairs combining classification and ranking targets. Finally, we will provide a semantic rule template self-learning method to adaptively extract entity information from short texts. In this way, we can realize the goal of automatic updating knowledge bases from short texts.
英文关键词: Knowledge Base;Short Texts;Knowledge Mining;Text Mining