Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on tasks like ER require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful transformer models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
翻译:根据两个庞大的记录清单,实体决议(ER)的任务是从与同一真实世界实体相对应的名单上的笛卡尔产品中找到与同名世界实体相对应的笛卡尔产品中的对子。一般而言,对像爱尔这样的任务的被动学习方法需要大量的贴标签数据才能产生有用的模型。积极学习是ER在低资源环境中的一个很有希望的方法。然而,搜索空间,为用户寻找信息样本以标签,发展象样的四面形任务,使得积极学习难于规模。以前的工作,在这一环境中,依靠手工制作的上游、预先训练的语言模型嵌入,或规则学习将卡斯特尔产品中不可能配对的对推走。这一阻塞步骤可能会在产品空间中的重要区域错失,导致低回顾。我们提议DIal,这是一个可扩展的积极学习方法,共同学习如何最大限度地回忆堵塞和准确匹配被阻隔的对子。DIAL使用一个指数-By-Commit框架,让每个委员会成员学习基于强大的变压模型的演示。我们强调匹配者与卡路德-D之间惊人的差异。在可更新的数据和可更新的数据和检索基准中,在使用的数据和精确度方法中显示目标。