项目名称: 中文手写文档识别中高阶上下文建模方法研究
项目编号: No.61305005
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 王秋锋
作者单位: 中国科学院自动化研究所
项目金额: 25万元
中文摘要: 自由书写的文档识别(字符串识别)是文字识别领域中的一个重要分支,目前还有很多问题亟待解决,以满足脱机文档电子化的需要。目前中文手写文档识别性能还很低下,其中一个很大原因是上下文信息利用不足,特别是高阶上下文的建模。本项目拟对文档识别中的高阶上下文信息建模进行深入研究,以达到提高识别性能、推动实际应用的目的。主要研究内容包含:(1)高阶语言上下文建模方法的研究,包括统计和结构语言模型的研究与应用;(2)高阶几何上下文建模方法的研究,包括几何特征提取和统计模型表示与学习;(3)融合高阶上下文模型的手写文档识别中高效搜索算法研究。我们将在已有的技术基础和数据基础上,通过结合统计模型和句法结构分析技术,采用快速准确的搜索算法,在大规模的自由书写中文文档识别中检验识别性能。本项目提出的高阶上下文建模和搜索方法与书写语言关系不大,因而可以推广到其他语言的文档识别中,比如英文、阿拉伯文等。
中文关键词: 手写字符串识别;语言模型;神经网络语言模型;语言模型自适应;几何模型
英文摘要: Unconstrained handwritten document recognition (string recognition) is an important branch of character recognition field, and faces great challenges due to its limited performance for the application of document digitalization. Unlike isolated character recognition, context information is very important for string recognition, however, most systems do not consider context modeling sufficiently, especially high-order context. This project studies context modeling techniques in document recognition to improve recognition performance, which include three main issues: (1) Modeling high-order linguistic context, including statistical and structured language models; (2) Modeling high-order geometric context, including geometric features extraction and statistical modeling; (3) Developing an efficient search algorithm for a huge and ever-expanding search space due to incorporating high-order context information. Based on the developed string recognition techniques and handwriting databases, we will test the proposed methods on a larg-scale of Chiniese handwritten documents via combining statistical and structured models with an efficient search algorithm. Moreover, the context modeling and serach method can also apply to the recognition of the documents of other languages (such as English and Arabic).
英文关键词: Handwritten text recognition;language model;neural network language model;language model adaptation;geometric model