Extracting the relevant information out of a large number of documents is a challenging and tedious task. The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not optimal. Information retrieval (IR) tasks become more challenging with the nontraditional language scripts, as in the case of Indic scripts. The authors have developed OCR (Optical Character Recognition) Search Engine to make an Information Retrieval & Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and Natural Language Processing (NLP) techniques. Here we have presented the study of the methods used for performing search and retrieval tasks. The details of this system, along with the statistics of the dataset (source: National Digital Library of India or NDLI), is also presented. Additionally, the ideas to further explore and add value to research in IRE are also discussed.
翻译:从大量文件中提取相关信息是一项艰巨而繁琐的任务。传统现有的全文搜索引擎和基于文本的图像检索系统所产生的结果质量并不理想。信息检索任务随着非传统语言文字的文字(如印度语文字)而变得更加具有挑战性。作者们开发了OCR(视像字符识别)搜索引擎,以建立一个信息检索和提取系统(IRE),利用IRE和自然语言处理技术复制目前最先进的方法。我们在这里介绍了对执行搜索和检索任务所用方法的研究。还介绍了该系统的细节以及数据集的统计数据(资料来源:印度国家数字图书馆或NDLI)。此外,还讨论了进一步探索和增加IRE研究价值的想法。