项目名称: 图书层次主题自动标引研究
项目编号: No.71303089
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 管理科学
项目作者: 陈静
作者单位: 华中师范大学
项目金额: 20万元
中文摘要: 随着电子图书信息资源的迅速增长,图书主题自动标引的粗粒度现状与信息用户需求的精细化趋势之间的矛盾日趋严重,进行图书层次主题自动标引是解决这一矛盾的有效方法。本项目在理论梳理与需求分析基础上,着力于构建图书层次主题自动标引模型及其方法体系,首先,设计图书目次识别算法,该算法融合机器学习及语义分析,从图书中提取目次特征与标记规则,接着,研制图书层次主题结构划分方法,利用目次识别和模糊检索划分出图书主题粗结构,利用层次主题模型和聚类分析,对图书主题粗结构划分得到的最小逻辑单元进行层次主题结构划分及主题标引,然后,通过基于概率主题模型的主题信息抽取方法,抽出图书主题粗结构中各逻辑单元的主题信息,实现图书层次主题自动标引,以细化图书信息研究粒度,拓展图书信息组织研究内容,推进图书信息资源管理与应用发展。
中文关键词: 层次主题;自动标引;图书;主题结构划分;主题抽取
英文摘要: With the rapid growth of electronic book information resources, the contradiction between coarse-granularity status of book topic indexing and fine-granularity trend of information users' needs becomes increasingly serious. Combining book topic structure partition and book hierarchy topics extraction to index book hierarchical topics(BHT) is an effective way to resolve the contradiction. On the basis of theoretical inspection and needs analysis, this project makes efforts to build an automatic indexing model for BHT and its methodologies with the help of artificial intelligence and data mining theories and methods. First, an algorithm combining machine-learning and semantic analysis for table of contents (TOC) recognizing is designed to mine characteristics and marking rules of TOC. Then, the structure of BHT is partitioned within two steps. The first step is book coarse structure partition following fuzzy retrieval model and results of TOC recognition, and the second step is that, by applying hierarchical topic model and clustering analysis, the lowest level text fragments from the former one are partitioned their hierarchical topics structure out and indexed. At last, topic extraction and indexing for book coarse structure are done with an algorithm based on probabilistic topic model. So, automatic indexing of
英文关键词: hierarchical topic;automatic indexing;book;topic structure partition;topic extraction