Successful Machine Learning based Named Entity Recognition models could fail on texts from some special domains, for instance, Chinese addresses and e-commerce titles, where requires adequate background knowledge. Such texts are also difficult for human annotators. In fact, we can obtain some potentially helpful information from correlated texts, which have some common entities, to help the text understanding. Then, one can easily reason out the correct answer by referencing correlated samples. In this paper, we suggest enhancing NER models with correlated samples. We draw correlated samples by the sparse BM25 retriever from large-scale in-domain unlabeled data. To explicitly simulate the human reasoning process, we perform a training-free entity type calibrating by majority voting. To capture correlation features in the training stage, we suggest to model correlated samples by the transformer-based multi-instance cross-encoder. Empirical results on datasets of the above two domains show the efficacy of our methods.
翻译:成功的机械学习基于命名实体识别模型可能在某些特殊领域的文本上失败,例如中国地址和电子商务名称,这些文本需要适当的背景知识。这些文本对于人类说明者来说也是困难的。事实上,我们可以从相关文本中获取一些可能有用的信息,这些文本有一些共同实体,有助于理解文本。然后,人们可以通过引用相关样本来很容易地解释正确的答案。在本文中,我们建议用相关样本来强化净化模型。我们从大型域域内无标签的大型数据中从稀疏的BM25检索器中抽取相关样本。为了明确模拟人类推理过程,我们用多数选票进行无培训实体类型校准。为了在培训阶段捕捉相关特征,我们建议用基于变压器的多连锁交叉编码模型来模拟相关样本。以上两个域数据集的实证结果显示了我们方法的功效。