项目名称: 面向PDF文档的数学公式搜索技术研究
项目编号: No.61472014
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 计算机科学学科
项目作者: 汤帜
作者单位: 北京大学
项目金额: 80万元
中文摘要: 数学公式凝聚着人类知识的精华,当前Web上公式资源日益丰富,加之公式自身结构复杂,公式搜索成为搜索引擎、知识管理等领域的一个研究热点;另一方面现有公式搜索技术主要面向Web资源(如MathML、LaTeX等),对于包含大量公式的主流文献格式-PDF文档,尚难以处理。为此本课题拟研究面向PDF文档的公式搜索技术,围绕公式的识别抽取、查询输入、分词索引、匹配排序等关键问题,研究公式定位与结构分析、基于层次泛化的分词索引、基于多级匹配的相似度评价、上下文匹配、排序学习与重排序等算法。本课题将实现一个完整的公式搜索流程,包括PDF文档中公式的自动发现与结构化、一种新颖便捷的公式输入(即从PDF文档中直接拷贝)等方法,解决现有公式搜索系统难以处理PDF文档公式的难题;所提出的半语义结构树构建、相似度计算等方法将有效提高公式搜索的查全率与查准率,推动公式搜索的技术进步,促进公式资源的挖掘利用。
中文关键词: 信息检索;搜索引擎;信息检索模型;语义检索;网络信息检索
英文摘要: Mathematical formulae are the essences of human knowledge. As the amount of digital resources containing mathematical formulae on the Web is being grown explosively and due to the complex structures of mathematical formulae, mathematics retrieval has become a new and important research topic in many areas, e.g., search engine, knowledge management. However, the existing mathematics retrieval techniques are mainly proposed towards structuralized mathematics resources on the Web and fail to process the unstructured mathematical contents in PDF, which is the most widely used document format containing abundant mathematical formulae. This proposal aims at researching on mathematics retrieval techniques towards PDF documents, including the following four key problems: identification and extraction of mathematical contents; query input interface; tokenization and indexing; matching and ranking. Concretely, this proposal targets at proposing a formula identification and semantic interpretation algorithm towards PDF documents; an indexing techniques considering hierarchical generalization of structures; a ranking approach based on multilevel matching, contexts, learning-to-rank and re-ranking strategies. In addition, a complete mathematics retrieval workflow will be implemented, with a crawler to discover and structuralize mathematical formulae automatically from PDF documents and a novel and convenient formula input interface, which facilitates users to copy&paste from PDF documents and solves the most challenging problem of obtaining structuralized formats of math formula queries from PDF document. Furthermore, the proposed semi-semantic tree construction and similarity calculation algorithms will effectively improve the recall and precision of mathematics retrieval, promote the development of mathematics retrieval techniques, and facilitate the excavation and usage of mathematics resources.
英文关键词: Information Retrieval;Search Engine;Information Retrieval Model;Semantic Retrieval;Web Information Retrieval