In today's technological era, document images play an important and integral part in our day to day life, and specifically with the surge of Covid-19, digitally scanned documents have become key source of communication, thus avoiding any sort of infection through physical contact. Storage and transmission of scanned document images is a very memory intensive task, hence compression techniques are being used to reduce the image size before archival and transmission. To extract information or to operate on the compressed images, we have two ways of doing it. The first way is to decompress the image and operate on it and subsequently compress it again for the efficiency of storage and transmission. The other way is to use the characteristics of the underlying compression algorithm to directly process the images in their compressed form without involving decompression and re-compression. In this paper, we propose a novel idea of developing an OCR for CCITT (The International Telegraph and Telephone Consultative Committee) compressed machine printed TIFF document images directly in the compressed domain. After segmenting text regions into lines and words, HMM is applied for recognition using three coding modes of CCITT- horizontal, vertical and the pass mode. Experimental results show that OCR on pass modes give a promising results.
翻译:在当今的技术时代,文件图像在日常生活中发挥着重要和不可或缺的作用,特别是随着Covid-19的激增,数字扫描文件已成为关键的通信来源,从而避免了任何通过物理接触感染的感染。扫描文件图像的存储和传输是一项非常记忆密集的任务,因此压缩技术正在用来降低存档和传输之前的图像大小。为了提取信息或操作压缩图像,我们有两个方法来做。第一个方法是将图像压缩并操作在图像上,然后再次压缩,以便提高存储和传输的效率。另一个方法是利用基本压缩算法的特性直接处理压缩图像,而不涉及压缩和再压缩。在这个文件中,我们提出了一个为CITT(国际电报和电话咨询委员会)开发OCR(OCR)的新想法。为了直接在压缩域内提取信息或操作压缩压缩的图像,我们有两个方法。在将文本区域分解成线和文字后,HMM被用于使用CITT(CIT)水平、垂直和通过模式的三种编码模式,直接处理其压缩图像,而不涉及压缩和再压缩的压缩图像。我们提出了一个新的想法,即实验结果显示CR(CR),在有希望的通过模式上的结果。