Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.
翻译:对于理解文件结构而言,文件布局分析至关重要。关于这一任务、文件的愿景和语义以及布局各组成部分之间的关系,有助于理解过程。虽然为利用上述信息提出了许多工作提案,但结果不尽人意。基于NLP的布局模型分析模型是一项序列标签任务,显示布局建模能力不足。基于CV的布局模型分析模型分析作为一种探测或分割任务,但具有效率低下的模式组合和布局各组成部分之间缺乏关系模型的局限性。为了解决上述限制,我们提议了一个统一的VSR框架,用于文件布局分析,将视觉、语义和关系结合起来。VSR支持基于NLP和CV的方法。具体地说,我们首先通过以文本嵌入地图的方式引入文件图像和语义。然后,将基于特定模式的视觉和语义特征提取出一个双流网络,通过适应性地结合来充分利用互为补充的信息。最后,最后,为了解决上述限制,我们提出的一个基于图形神经网络的关系模块将并入模块与组件和输出结果之间的关系。在三种流行基准上,VSR将很快通过大型模型发布。