Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.
翻译:表格在各种结构中广泛使用,以组织和展示数据。最近关于表格理解的尝试主要侧重于关系表,但忽略了其他共同表格结构。在本文件中,我们提出TUTA,这是一个统一的训练前结构,以了解一般结构化表格。注意一个表格需要空间、等级和语义信息,我们用三种新的结构化机制加强变压器。首先,我们设计一个统一的树基结构,称为双维协调树,以描述一般结构化表格的空间和等级信息。我们据此提议植树为主的注意和位置,以更好地捕捉空间和等级信息。此外,我们制定了三个渐进式培训前目标,以便能够在符号、单元格和表格级别上进行演示。我们在广泛的无标签网络和表格表格上对TUTA进行预先培训,并将其微调用于在表格结构理解领域的两项关键任务:细胞类型分类和表格类型分类。实验显示TUTA非常有效,在五个广泛研究的数据集上实现了最新设计。