Visual information extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust visual information extraction system (VIES) towards real-world scenarios, which is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higher-level semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (https://github.com/HCIILAB/EPHOIE), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9.01% F-score gain on the widely used SROIE dataset under the end-to-end scenario.
翻译:最近,由于文件理解、自动标识和智能教育等各种先进应用,视觉信息提取(VIE)最近引起了相当的关注。大多数现有作品将这一问题分解成数个独立的文本检测(文本检测和识别)和信息提取子任务,在优化过程中完全忽视了它们之间的高度相关性。在本文件中,我们建议为现实世界情景建立一个强大的视觉信息提取系统(VIES),这是用于同时检测、识别和信息提取的统一端到端的可培训框架,其方法是将单一的文件图像作为输入和输出结构化信息。具体地说,信息提取分支收集了从多式特征聚合和反向的文本检测(文本检测和识别)和信息提取的多个独立的子任务。此外,关于公共基准的短缺,我们建了一个称为EPHOIE的全称附加说明的数据集(https://github.com/HCIILAB/EPHOIE),这是中国文本定位和视觉信息提取的首个基准。EPIEOIEIE(包括1 494的高级图像),在中国纸面图纸面图纸面图中,包括了我们使用的纸面图纸面图的印刷图。