Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, the previous methods struggle with imprecise bounding boxes as the logical representation lacks local visual information. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.
翻译:表格结构识别旨在将非结构化表格图像的逻辑和物理结构转换成机器可读格式。最新的端到端图像到文字方法同时用两个解码器预测两个结构,其中对物理结构(单元格的捆绑框)的预测以逻辑结构的表示为基础。然而,以往的方法与不精确的捆绑框进行斗争,因为逻辑表达方式缺乏本地视觉信息。为了解决这一问题,我们提议了一个称为 VAST 的表格结构识别的端到端顺序建模框架。它包含一个由逻辑结构解码器中非安全单元格的表示所触发的新的协调序列解码器。在协调序列解码器中,我们将捆绑框坐标作为语言序列进行建模,而左、上、右和底坐标则按顺序解码进行分解码,以利用相交错的依赖性。此外,我们提议了一个辅助的视觉连接损失,以强制非空格单元格的逻辑表达式结构包含更多的本地直观细节,有助于产生更好的细胞绑定框。广泛的实验表明,我们拟议的视觉结构中的拟议逻辑序列验证方法也可以实现逻辑序列的逻辑序列验证。</s>