Many organizations rely on data from government and third-party sources, and those sources and organizations do not follow the same data formatting. This introduces challenges in integrating data from multiple sources. Commercial database systems do not offer adequate support for integrating data from heterogeneous sources, and manual integration is both time-consuming and inefficient. While state-of-the-art approaches rely on similarity functions and textual transformations, they often fail to handle challenging cases where multiple mappings are required, or the mappings go beyond simple textual transformations. In this paper, we study the potential of deep neural models for transforming tables for joinability. In particular, we cast the problem as a prediction task and develop a framework that leverages large deep-learning language models to transform tabular data from a source formatting to a desired target representation. Our framework can efficiently learn the pattern for mapping the source formatting into the expected target using just a few examples, which can then be used for table joining, filling in missing values, and error detection. Compared to state-of-the-art mapping and joining approaches, our framework delivers noticeably more accurate and scalable performance on both real-world and synthetic datasets. Our experimental evaluation also shows that the performance of the proposed framework using our fine-tuned model is at par or better than large language models such as GPT-3, despite the significant difference in size, and that integrating large language models into our framework improves their performance.
翻译:许多组织依靠来自政府和第三方来源的数据,而这些来源和组织没有遵循同样的数据格式。这在整合来自多种来源的数据方面提出了挑战。商业数据库系统没有为整合来自不同来源的数据提供足够支持,人工整合既耗时又效率低下。虽然最先进的方法依靠的是相似功能和文字转换,但它们往往无法处理需要多重绘图或绘图超越简单文本转换的具有挑战性的案例。在本文件中,我们研究了深层神经模型在转换表格以方便加入方面的潜力。特别是,我们把问题作为一个预测任务,并开发一个框架,利用大型深学习语言模型将表格数据从源格式转换为理想的目标代表制,而人工整合既耗时又效率低下。我们的框架可以有效地学习将源格式绘制成预期目标的模式,仅用几个例子就可以用来合并表格,填充缺失的数值,并发现错误。与最先进的语言绘图和组合方法相比,我们的框架在现实世界和合成数据模型的大规模绩效评估中也提供了更准确和可缩缩略的模型,尽管我们在大的语言框架中提出了更好的模型,在实际和合成数据方面的改进了我们的大规模业绩。</s>