Entity coreference resolution is an important research problem with many applications, including information extraction and question answering. Coreference resolution for English has been studied extensively. However, there is relatively little work for other languages. A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. To overcome this challenge, we design a simple but effective ensemble-based framework that combines various transfer learning (TL) techniques. We first train several models using different TL methods. Then, during inference, we compute the unweighted average scores of the models' predictions to extract the final set of predicted clusters. Furthermore, we also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts. Leveraging the idea that the coreferential links naturally exist between anchor texts pointing to the same article, our method builds a sizeable distantly-supervised dataset for the target language that consists of tens of thousands of documents. We can pre-train a model on the pseudo-labeled dataset before finetuning it on the final target dataset. Experimental results on two benchmark datasets, OntoNotes and SemEval, confirm the effectiveness of our methods. Our best ensembles consistently outperform the baseline approach of simple training by up to 7.68% in the F1 score. These ensembles also achieve new state-of-the-art results for three languages: Arabic, Dutch, and Spanish.
翻译:实体参考解析是一个重要的研究问题, 包括信息提取和回答问题。 已经广泛研究了英语的引用解析方法。 但是, 其它语言的参考解析方法相对较少。 与非英语语言合作时经常出现的问题是缺乏附加说明的培训数据。 为了克服这一挑战, 我们设计了一个简单而有效的混合框架, 将各种传输学习( TL) 技术结合起来。 我们首先用不同的 TL 方法培训数个模型。 然后, 在推断过程中, 我们计算模型预测的未加权平均分数, 以提取最后一组预测的群集。 此外, 我们还提出一种低成本的 TL 方法, 利用维基百科主页的文本来连接共同引用解析模式。 利用同一文章的固定文本之间自然存在的共同优惠联系这一想法, 我们的方法为由数万种西班牙文文件组成的目标语言建立了一个相当远、 远的超超链接的数据集。 在对最终目标数据集进行校准之前, 我们还可以对虚拟标定的状态进行模型。 实验性结果: 在两个基准中, OnNototobles 3 sqoral ableglegal view 方法中, ex the slegregal laps the sqolald the sleget the slavealdaldaldaldaldaldaldalddals.