For digitizing or indexing physical documents, Optical Character Recognition (OCR), the process of extracting textual information from scanned documents, is a vital technology. When a document is visually damaged or contains non-textual elements, existing technologies can yield poor results, as erroneous detection results can greatly affect the quality of OCR. In this paper we present a detection network dubbed BusiNet aimed at OCR of business documents. Business documents often include sensitive information and as such they cannot be uploaded to a cloud service for OCR. BusiNet was designed to be fast and light so it could run locally preventing privacy issues. Furthermore, BusiNet is built to handle scanned document corruption and noise using a specialized synthetic dataset. The model is made robust to unseen noise by employing adversarial training strategies. We perform an evaluation on publicly available datasets demonstrating the usefulness and broad applicability of our model.
翻译:为了将物理文件数字化或编制索引,光学字符识别(OCR)是从扫描文件中提取文本信息的过程,是一项至关重要的技术,当文件被视觉损坏或含有非文字元素时,现有技术会产生不良的结果,因为错误的检测结果会严重影响OCR的质量。在本文件中,我们提出了一个名为BusiNet的检测网络,目的是为OCR的商业文件建立数字网络。商业文件通常包括敏感信息,因此不能上传到OCR的云服务处。BusiNet的设计是快速和轻便的,以便在当地进行预防隐私问题的工作。此外,BusiNet是用来利用专门的合成数据集处理扫描文件腐败和噪音的,通过采用对抗性培训战略使该模型对看不见的噪音产生强大的影响。我们评估了公开存在的数据集,以表明我们的模型的实用性和广泛适用性。