基于自然语言处理技术的生物实体语义网络研究和应用

项目名称： 基于自然语言处理技术的生物实体语义网络研究和应用

项目编号： No.61202304

项目类型： 青年科学基金项目

立项/批准年度： 2013

项目学科： 计算机科学学科

项目作者： 魏小梅

作者单位： 华中农业大学

项目金额： 25万元

中文摘要： 本项目研究并抽取生物文献中命名实体（蛋白质、基因和疾病）之间的语义关系，并建立面向疾病的语义关系网络，将网络以可视化的形式展示。我们需要做的工作有：标注相关的实体关系资源；研究和建立实体关系抽取模型；抽取关系并建立实体语义关系网络；研究发现新的实体关系的方法；研究关系网络随语料时段变化的演化趋势，发现生物研究热点；将实体关系网络应用于乳腺肿瘤的诊断和研究。根据研究领域和研究对象的特点，为了实现以上目标，我们采用基于自然语言处理的方法结合多核学习的机器学习技术研究本项目。具体是以开放的生物文献为研究对象，首先采用自然语言处理技术对文本进行处理，获得句法解析树和依存图，然后利用解析结果得到合适的句法和语义特征，采用基于Simth-Waterman算法的复合核函数抽取实体之间的语义关系，构建语义网络，我们将对抽取的语义网络应用到乳腺肿瘤的早期辅助诊断。我们还将对结果利用公开数据进行评估。

中文关键词： 生物文本；名称实体；语料；关系抽取；知识库

英文摘要： We focus on the extraction of semantic relations between biomedical entities such as proteins, gene and disease in this project. And on the basis of semantic relations a semantic network is created and visualized. Our research takes the following procedures: annotating resources based on entity relations; building extraction model of entity relations;extracting relations and building semantic relation network;exploring novel approaches in discovering entity relation;pursuing the evolving tendency of relation network changing with the corpus along the timeline and discovering new dimensions in biomedicine; applying entity relation network in the diagnosis of breast cancer. In accordance with the characteristics of the domain and the object of research, we integrate the techniques of natural language processing and multi-kernel learning in this project. We adopt the open-ended biomedical texts as the corpus of the research. First we process the texts by the techniques of natural language processing. After initial processing, we acquire the syntax parsing tree and dependency graph. Then we extract the features from the sentence parsing outcomes. We adopt Smith-Waterman algorithm to extract semantic relations out of entities. The semantic network is built on the semantic relations. Finally, we will apply the semanti

英文关键词： biomedical text；named entity；corpus；relation extraction；knowledge base

成为VIP会员查看完整内容