Accurately identifying different representations of the same real-world entity is an integral part of data cleaning and many methods have been proposed to accomplish it. The challenges of this entity resolution task that demand so much research attention are often rooted in the task-specificity and user-dependence of the process. Adopting deep learning techniques has the potential to lessen these challenges. In this paper, we set out to devise an entity resolution method that builds on the robustness conferred by deep autoencoders to reduce human-involvement costs. Specifically, we reduce the cost of training deep entity resolution models by performing unsupervised representation learning. This unveils a transferability property of the resulting model that can further reduce the cost of applying the approach to new datasets by means of transfer learning. Finally, we reduce the cost of labelling training data through an active learning approach that builds on the properties conferred by the use of deep autoencoders. Empirical evaluation confirms the accomplishment of our cost-reduction desideratum while achieving comparable effectiveness with state-of-the-art alternatives.
翻译:准确地确定同一真实世界实体的不同表述方式是数据清理的一个组成部分,并提出了完成这项工作的许多方法。这一实体决议任务的挑战要求如此多的研究关注,其挑战往往根植于这一过程的具体任务和用户依赖性。采用深层次的学习技术有可能减轻这些挑战。在本文件中,我们提出要设计一种实体解决方法,该方法以深层自动调整器所赋予的强力为基础,以减少人类参与成本。具体地说,我们通过进行不受监督的模拟学习,降低了深层次实体分辨率模型的培训成本。这揭示了由此形成的模型的可转移性属性,通过转移学习手段进一步降低对新数据集应用方法的成本。最后,我们通过一种积极学习方法,在使用深层自动调整器所赋予的特性的基础上,降低培训数据标签的成本。Empicalal评估证实我们实现了降低成本的分离模型,同时实现与最先进的替代方法的类似效果。