OCR 背景下的文字线分割 (Combining Morphological and Histogram based Text Line Segmentation in the OCR Context)

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or curved text lines. For that reason, the segmenter in question could be of particular interest for cultural institutions, such as libraries, archives, museums, ..., that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

翻译：文本线分解是现代光学字符识别系统的一个前阶段。本文件提议的算法方法就是为这一确切目的设计的。其主要特征是两种不同的技术、形态图象操作和横向直方图预测相结合。该方法是用于历史数据收集的,通常具有质量问题,如纸张退化、文字模糊或曲线文字线。因此,有关分解器对于希望为某一历史文件获得稳健线条框的文化机构,如图书馆、档案馆、博物馆.可能特别感兴趣。由于计算成本低而带来的有希望的分解结果,该算法被纳入卢森堡国家图书馆的OCR管道,这是在对历史报纸收藏进行再处理的主动行动中。本文的一般贡献是概述方法,评价准确性和速度方面的成果,将它与与所使用的开放源OCR软件捆绑的分解算法进行比较。

相关内容

光学字符识别

关注 44

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；即，针对印刷体字符，采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件，并通过识别软件将图像中的文字转换成文本格式，供文字处理软件进一步编辑加工的技术。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【ECCV2020】OCRNet化解语义分割上下文信息缺失难题

专知会员服务

17+阅读 · 2020年8月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

基于图的word2vec负采样( GNEG:Graph-Based Negative Sampling for word2vec)

专知会员服务

40+阅读 · 2019年11月23日