A large amount of document data exists in unstructured form such as raw images without any text information. Designing a practical document image analysis system is a meaningful but challenging task. In previous work, we proposed an intelligent document analysis system PP-Structure. In order to further upgrade the function and performance of PP-Structure, we propose PP-StructureV2 in this work, which contains two subsystems: Layout Information Extraction and Key Information Extraction. Firstly, we integrate Image Direction Correction module and Layout Restoration module to enhance the functionality of the system. Secondly, 8 practical strategies are utilized in PP-StructureV2 for better performance. For Layout Analysis model, we introduce ultra light-weight detector PP-PicoDet and knowledge distillation algorithm FGD for model lightweighting, which increased the inference speed by 11 times with comparable mAP. For Table Recognition model, we utilize PP-LCNet, CSP-PAN and SLAHead to optimize the backbone module, feature fusion module and decoding module, respectively, which improved the table structure accuracy by 6\% with comparable inference speed. For Key Information Extraction model, we introduce VI-LayoutXLM which is a visual-feature independent LayoutXLM architecture, TB-YX sorting algorithm and U-DML knowledge distillation algorithm, which brought 2.8\% and 9.1\% improvement respectively on the Hmean of Semantic Entity Recognition and Relation Extraction tasks. All the above mentioned models and code are open-sourced in the GitHub repository PaddleOCR.
翻译:大量文件数据以非结构化的形式存在,如原始图像,没有任何文字信息。设计实用文件图像分析系统是一项有意义但富有挑战性的任务。在以往的工作中,我们提议建立一个智能文件分析系统PP-结构结构。为了进一步提升PP-结构的功能和性能,我们提议在这项工作中采用PP-结构V2系统,其中包括两个子系统:版式信息提取和关键信息提取。首先,我们整合了图像方向校正模块和布局恢复模块,以加强系统的功能。第二,在PP-Strial结构V2中使用了8项实际战略,以提高性能。在布局分析模型中,我们采用了超轻量检测器PP-P-PicoD和知识蒸馏算法,用于模型加权,将推断速度提高11倍。在表识别模型中,我们使用PP-LCNet、CSP-PANS-Sloadmanstal Streformal 模块和分解模块,这在Sal-ILS-L Ral-SL Ral-SL 上分别改进了表结构结构,在SIL-RIS-SL-S-S-S-S-Slal-S-S-S-S-Sl-Sl-S-Slal-S-S-S-S-SL-SL-Sl-Sl-S-S-S-S-S-SL-S-S-S-S-S-S-SL-S-S-SL-S-SL-SL-SL-SL-SL-SL-SL-SL-S-S-S-SL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Sl-S-S-S-S-S-S-SL-SL-SL-SL-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S