项目名称: 基于弱监督学习的水果品种信息自动抽取方法研究
项目编号: No.61503386
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 自动化技术、计算机技术
项目作者: 陈瑛
作者单位: 中国农业大学
项目金额: 20万元
中文摘要: 中国水果种质资源信息服务需要完整的水果品种名称知识库,而人工构建和维护这种知识库费时费力。本项目拟采用信息抽取技术自动扩充已有的水果品种名称知识库。考虑到网络百科和电商广告提供了大量的水果品种信息,本项目选择对这两种文本展开深入研究,提出了一种基于弱监督学习的水果品种信息抽取方法。首先,针对面向农业领域的中文分词性能不足问题,提出了一种基于语言结构的词项对提取方法,其根据话题结构理论进行词项对的提取,避开了中文分词问题。其次,针对已有知识库覆盖率低的问题,提出了一个基于图的关系推理方法,其通过基于图的推理技术有效利用不同结构文本中的冗余信息进行关系识别。最后,针对远距离关系识别中的特征信息高噪音问题,提出了一种基于卷积神经网络的特征提取方法,其利用话题结构从篇章角度有效进行特征提取。本项目的深入研究将为农业信息化发展提供重要基础,同时能够促进基于弱监督学习机制的信息抽取技术理论研究的发展。
中文关键词: 信息抽取;;弱监督学习;话题结构;深度学习;图模型
英文摘要: The information services for fruit variety information need a complete set of fruit variety names. However, it is high-cost to construct and maintain a knowledge base of fruit variety names. Therefore, we determine to use the technology of information extraction to populate the current knowledge base of fruit variety names..There are two types of texts used in our project: online advertisements and an online encyclopedia. The two types of texts not only provide large-scale information about fruit varieties, but also reflect the real expressions of fruit varieties. Based on the two types of texts, we propose a weakly-supervised information extraction method. Firstly, because Chinese word segmentation cannot work well for agriculture-based texts, we propose a linguistic-structure-based word extraction method which uses topical structures to extract pairs of words. Secondly, because the coverage of current knowledge base of fruit variety names is low, we propose a relation induction method based on a graph-based model,which uses the redundant information in structured texts and free texts. Thirdly, because of the noisy features in a long-distance relation detection method, we propose a document-level feature extraction method which uses topical structures to extract position features and then uses convolutional deep neural network to aggregate features. .Overall, our project can not only can provide an important basis for the development of agricultural information, but also can promote the development of the theoretical research on information extraction technologies.
英文关键词: information extraction;weakly-supervised learning; topical structures;deep learning;graph-based models