The information in tables can be an important complement to text, making table-based question answering (QA) systems of great value. The intrinsic complexity of handling tables often adds an extra burden to both model design and data annotation. In this paper, we aim to develop a simple table-based QA model with minimal annotation effort. Motivated by the fact that table-based QA requires both alignment between questions and tables and the ability to perform complicated reasoning over multiple table elements, we propose an omnivorous pretraining approach that consumes both natural and synthetic data to endow models with these respective abilities. Specifically, given freely available tables, we leverage retrieval to pair them with relevant natural sentences for mask-based pretraining, and synthesize NL questions by converting SQL sampled from tables for pretraining with a QA loss. We perform extensive experiments in both few-shot and full settings, and the results clearly demonstrate the superiority of our model OmniTab, with the best multitasking approach achieving an absolute gain of 16.2% and 2.7% in 128-shot and full settings respectively, also establishing a new state-of-the-art on WikiTableQuestions. Detailed ablations and analyses reveal different characteristics of natural and synthetic data, shedding light on future directions in omnivorous pretraining. Code, pretraining data, and pretrained models are available at https://github.com/jzbjyb/OmniTab.
翻译:表格中的信息可以是文本的重要补充,使基于表格的回答问题(QA)系统具有巨大的价值。处理表格的内在复杂性往往给模型设计和数据注释增加额外的负担。在本文中,我们的目标是开发一个简单的基于表格的QA模式,尽量减少批注努力。基于表格的QA要求问题和表格之间保持一致,并有能力对多个表格要素进行复杂的推理,因此我们建议一种混合的预培训方法,将自然和合成数据消耗到具备这些能力的底部模型中。具体地说,根据免费提供的表格,我们利用检索将它们与基于遮罩的预培训的相关自然句子配对,并综合NL问题,将SQL样本从培训前的表格中转换成带有QA损失的预演。我们从几个镜头和全套背景中进行广泛的实验,结果清楚地表明了我们的模型OmniTab的优越性,而最佳的多任务方法在128张和全环境中分别实现了16.2%和2.7%的绝对增益率。此外,我们利用了一个新的州/州/州/州级数据库的图表,还进行了新的、不同的分析。