在实践中开展 " 了解表格 " 工作 (Making Table Understanding Work in Practice)

Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice? We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SigmaTyper which implements this framework for the semantic column type detection task. SigmaTyper encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice.

翻译：了解表格的语义对于数据整合、准备和搜索等任务至关重要。表格理解方法旨在探测表格的主题、语义柱类型、柱子关系或实体。随着深层次学习的兴起,为这些任务开发了强大的模型,在基准方面准确性极强。然而,我们注意到,这些模型在这些基准的绩效与这些基准的实际适用性之间存在着差距。在本文件中,我们处理的问题是:这些模型需要什么才能在实践中发挥作用?我们讨论了部署表格理解模型的三项挑战,并提出了应对这些挑战的框架。这些挑战包括:(1) 将模型定制到具体领域方面的困难;(2) 企业常见的典型数据库表格缺乏培训数据;(3) 对模型作出的推断缺乏信心。我们介绍SigmaTyper,用以执行这些模型的语义学类型探测任务。SigmatTyper包装了一个在吉他表上受过培训的混合模型,并结合了一种较轻的人类在地对模型进行定制的方法。最后,我们强调未来研究的渠道,以进一步缩小差距,从而了解表格中的有效做法。