Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled training data becomes expensive when the data from many sources arrives incrementally over time. Moreover, the trained models can easily overfit to specific data sources, and thus fail to generalize to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. AdaMEL models the attribute importance that is used to match entities through an attribute-level self-attention mechanism, and leverages the massive unlabeled data from new data sources through domain adaptation to make it generic and data-source agnostic. In addition, AdaMEL is capable of incorporating an additional set of labeled data to more accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning. Besides, it is more stable in handling different sets of data sources in less runtime.
翻译:多来源实体联系的重点是通过连接代表同一真实世界实体的记录,整合来自多种来源的知识,这对数据清理和用户缝合等高影响应用至关重要。最先进的实体联系管道主要取决于监督学习,需要大量培训数据。然而,如果从许多来源收集的数据随着时间的推移而逐渐增加,收集贴好标签的培训数据将变得昂贵。此外,经过培训的模型很容易地超越特定数据源,因而由于数据和标签分布上的显著差异,无法向新的来源推广。为了应对这些挑战,我们提供了AdaMEL,这是一个深层次的传输学习框架,它学习通用的高层次知识,以建立多来源实体联系。AdaMEL模型显示,通过属性层面的自留机制来匹配实体的属性重要性,并且通过领域调整来利用来自新数据源的大量未贴上标签的数据,使其成为通用和数据源的量化数据。此外,AdaMEL有能力将更多的标签数据集成,以便更准确地整合数据源,使其具有不同属性的重要性。AdaMEL进行广泛的实验后,在不同的平均数据源中进行更稳定的学习。