As organizations continue to access diverse datasets, the demand for effective data integration has increased. Key tasks in this process, such as schema matching and entity resolution, are essential but often require significant effort. Although previous studies have aimed to automate these tasks, the influence of dataset characteristics on the matching effectiveness has not been thoroughly examined, and combinations of different methods remain limited. This study introduces a contextual graph embedding technique that integrates structural details from tabular data and contextual elements such as column descriptions and external knowledge. Tests conducted on datasets with varying properties such as domain specificity, data size, missing rate, and overlap rate showed that our approach consistently surpassed existing graph-based methods, especially in difficult scenarios such those with a high proportion of numerical values or significant missing data. However, we identified specific failure cases, such as columns that were semantically similar but distinct, which remains a challenge for our method. The study highlights two main insights: (i) contextual embeddings enhance the matching reliability, and (ii) dataset characteristics significantly affect the integration outcomes. These contributions can advance the development of practical data integration systems that can support real-world enterprise applications.
翻译:随着组织持续获取多样化数据集,对有效数据集成需求日益增长。该过程中的关键任务,如模式匹配和实体解析,虽至关重要但通常需耗费大量人力。尽管先前研究致力于自动化这些任务,但数据集特征对匹配效果的影响尚未得到充分探究,且不同方法的组合应用仍显有限。本研究提出一种上下文图嵌入技术,该技术融合了表格数据的结构细节以及列描述、外部知识等上下文要素。在具有不同属性(如领域特异性、数据规模、缺失率和重叠率)的数据集上进行测试表明,我们的方法始终优于现有基于图的方法,尤其在数值比例高或数据缺失严重的困难场景中表现突出。然而,我们也识别出特定失败案例,例如语义相似但本质不同的列,这仍是本方法面临的挑战。本研究强调两个主要发现:(i)上下文嵌入提升了匹配可靠性;(ii)数据集特征显著影响集成结果。这些贡献可推动实用数据集成系统的发展,以支持现实企业应用。