Accessing daily news content still remains a big challenge for people with print-impairment including blind and low-vision due to opacity of printed content and hindrance from online sources. In this paper, we present our approach for digitization of print newspaper into an accessible file format such as HTML. We use an ensemble of instance segmentation and detection framework for newspaper layout analysis and then OCR to recognize text elements such as headline and article text. Additionally, we propose EdgeMask loss function for Mask-RCNN framework to improve segmentation mask boundary and hence accuracy of downstream OCR task. Empirically, we show that our proposed loss function reduces the Word Error Rate (WER) of news article text by 32.5 %.
翻译:每日获取新闻内容对于印刷缺陷的人来说,包括盲人和低视率的人来说,仍是一个巨大的挑战,因为印刷内容不透明,并且受到在线来源的阻碍。在本文中,我们提出我们的方法,将印刷报纸数字化为一种无障碍的文件格式,例如HTML。我们使用一个实例分解和检测框架来进行报纸布局分析,然后由OCR来识别头条和文章文本等文本要素。此外,我们提议为Mask-RCNN框架提供EdgeMask损失功能,以改善分解蒙面的界限,从而改进下游OCR任务的准确性。我们经常地表明,我们拟议的损失功能将新闻文章文本的文字错误率降低32.5%。