Existing question answering systems mainly focus on dealing with text data. However, much of the data produced daily is stored in the form of tables that can be found in documents and relational databases, or on the web. To solve the task of question answering over tables, there exist many datasets for table question answering written in English, but few Korean datasets. In this paper, we demonstrate how we construct Korean-specific datasets for table question answering: Korean tabular dataset is a collection of 1.4M tables with corresponding descriptions for unsupervised pre-training language models. Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers. Subsequently, we then build a pre-trained language model based on Transformer, and fine-tune the model for table question answering with these datasets. We then report the evaluation results of our model. We make our datasets publicly available via our GitHub repository, and hope that those datasets will help further studies for question answering over tables, and for transformation of table formats.
翻译:现有的回答问题系统主要侧重于处理文本数据。然而,每天产生的数据大多以表格的形式存储,表格的形式可以在文件和关系数据库中找到,或者在网上找到。为了解决回答表格问题的任务,有许多数据集用于表格问题回答用英文写成,但韩国的数据集很少。在本文中,我们展示了我们如何为回答表格问题而构建韩国特有的数据集:韩国表格数据集是一个1.4M表集,配有未经监督的培训前语言模型的相应描述。韩国表格问题解答库包括70k对问答。随后,我们建立了一个基于变换器的预先培训的语言模型,并微调用于回答这些数据集的表格问题模型。我们然后报告我们的模型的评估结果。我们通过我们的GitHub存储库公开提供我们的数据集,并希望这些数据集将有助于进一步研究在表格上解答问题和转换表格格式。