Since a vast number of tables can be easily collected from web pages, spreadsheets, PDFs, and various other document types, a flurry of table pre-training frameworks have been proposed following the success of text and images, and they have achieved new state-of-the-arts on various tasks such as table question answering, table type recognition, column relation classification, table search, formula prediction, etc. To fully use the supervision signals in unlabeled tables, a variety of pre-training objectives have been designed and evaluated, for example, denoising cell values, predicting numerical relationships, and implicitly executing SQLs. And to best leverage the characteristics of (semi-)structured tables, various tabular language models, particularly with specially-designed attention mechanisms, have been explored. Since tables usually appear and interact with free-form text, table pre-training usually takes the form of table-text joint pre-training, which attracts significant research interests from multiple domains. This survey aims to provide a comprehensive review of different model designs, pre-training objectives, and downstream tasks for table pre-training, and we further share our thoughts and vision on existing challenges and future opportunities.
翻译:由于可以很容易地从网页、电子表格、PDF和各种其他文件类型中收集大量表格,因此,在文本和图像成功之后,提出了一连串培训前框架,这些框架在各种任务方面达到了新的状态,例如:表答、表格类型识别、列比关系分类、表格搜索、公式预测等。 为了在未加标签的表格中充分利用监督信号,设计和评价了各种培训前目标,例如,消除了细胞值、预测数字关系和隐含地执行SQL。为了最好地利用(半)结构化表格的特点,已经探索了各种表格语言模型,特别是专门设计的注意机制。由于表格通常出现并与自由格式文本互动,表格前培训通常采取表格文本联合培训的形式,这吸引了多个领域的重大研究兴趣。这次调查的目的是全面审查不同的模型设计、培训前目标以及培训前的下游任务。我们进一步分享关于现有挑战和未来机会的想法和愿景。