Entity Resolution (ER) aims to identify whether two tuples refer to the same real-world entity and is well-known to be labor-intensive. It is a prerequisite to anomaly detection, as comparing the attribute values of two matched tuples from two different datasets provides one effective way to detect anomalies. Existing ER approaches, due to insufficient feature discovery or error-prone inherent characteristics, are not able to achieve stable performance. In this paper, we present CollaborER, a self-supervised entity resolution framework via multi-features collaboration. It is capable of (i) obtaining reliable ER results with zero human annotations and (ii) discovering adequate tuples' features in a fault-tolerant manner. CollaborER consists of two phases, i.e., automatic label generation (ALG) and collaborative ER training (CERT). In the first phase, ALG is proposed to generate a set of positive tuple pairs and a set of negative tuple pairs. ALG guarantees the high quality of the generated tuples and hence ensures the training quality of the subsequent CERT. In the second phase, CERT is introduced to learn the matching signals by discovering graph features and sentence features of tuples collaboratively. Extensive experimental results over eight real-world ER benchmarks show that CollaborER outperforms all the existing unsupervised ER approaches and is comparable or even superior to the state-of-the-art supervised ER methods.
翻译:实体分辨率( ER) 旨在确定两个实体是否指同一个真实世界实体,并且众所周知,它是劳动密集型的实体。这是异常检测的一个先决条件,因为比较两个不同数据集中两个匹配的图例的属性值是发现异常的一种有效方法。现有的ER方法,由于特征发现不足或易出错的固有特征,无法取得稳定的业绩。在本文中,我们介绍一个通过多功能协作自监管的实体分辨率框架Collaborer。它能够( 一) 获得可靠的ER结果,但人类说明为零,以及(二) 以容错方式发现适当的图例特征。CollaborER由两个阶段组成,即自动标签生成和协作ER培训。在第一阶段,我们提议ALG将产生一套积极的双对和一组负双对。ALG保证产生的图例的质量,从而确保随后的CER培训质量。在第二个阶段,甚至以耐错的方式发现适当的图具特征。Cloboral- ERER( ) 展示了所有可比较的实验性标准,即实验性地球模型的特征,通过真实的实验性标准,来学习对等现有信号。