Name Entity Disambiguation is the Natural Language Processing task of identifying textual records corresponding to the same Named Entity, i.e. real-world entities represented as a list of attributes (names, places, organisations, etc.). In this work, we face the task of disambiguating companies on the basis of their written names. We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings in a (relatively) low dimensional vector space and use this representation to identify pairs of company names that actually represent the same company (i.e. the same Entity). Given that the manual labelling of string pairs is a rather onerous task, we analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline. With empirical investigations, we show that our proposed Siamese Network outperforms several benchmark approaches based on standard string matching algorithms when enough labelled data are available. Moreover, we show that Active Learning prioritisation is indeed helpful when labelling resources are limited, and let the learning models reach the out-of-sample performance saturation with less labelled data with respect to standard (random) data labelling approaches.
翻译:摘要:实体名称消歧是自然语言处理的一个任务,其目的是识别文本记录,以便将其归为同一命名实体,即以属性列表(名称、地点、组织等)表示的真实世界实体。在这项工作中,我们面临的任务是根据它们的书面名称消除公司的歧义。我们提出了一种使用“连体LSTM网络”的方法来提取公司名称字符串的嵌入,以及如何使用这个表示来识别实际代表同一公司(即同一实体)的公司名称对。鉴于手动标记字符串对是一项相当繁琐的任务,我们分析了如何使用主动学习方法优先考虑要标记的样本,从而导致更有效的整体学习管道。通过经验调查,我们展示了当有足够标记数据可用时,我们提出的连体网络优于几个基于标准字符串匹配算法的基准方法。此外,我们展示出当标记资源有限时,主动学习优先考虑的确有帮助,相对于标准(随机)数据标记方法,这让学习模型在更少的标记数据下就可以达到饱和的样本外性能。