Page-level analysis of documents has been a topic of interest in digitization efforts, and multimodal approaches have been applied to both classification and page stream segmentation. In this work, we focus on capturing finer semantic relations between pages of a multi-page document. To this end, we formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction, inspired by the dependency parsing literature. We further design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies using textual and visual features extracted from the pages. Moreover, we also combine the features from two modalities to obtain multimodal page embeddings. To the best of our knowledge, this is the first study to extract rich semantic interpage relations from multi-page documents. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
翻译:文件的页级分析一直是数字化工作感兴趣的一个专题,而且多式方法已应用于分类和页流分割。在这项工作中,我们侧重于捕捉多页文件各页之间精细的语义关系。为此,我们正式确定这项任务为对跨页关系的语义解析,并提议在依赖性分析文献的启发下,对跨页依赖性提取采取端到端的方法。我们进一步设计了多任务培训方法,以利用从页面中提取的文字和视觉特征,联合优化页面嵌入分解、分类和分划。此外,我们还将两种模式的特征结合起来,以获得多页嵌入。据我们所知,这是从多页文件中提取丰富的语义的跨页关系的第一个研究。我们的实验结果显示,拟议方法将语义分隔增加了41个百分点,页流分解增加了33个百分点,页码分类增加了45个百分点,超越了天真基线。