Distant supervision (DS) is a well established technique for creating large-scale datasets for relation extraction (RE) without using human annotations. However, research in DS-RE has been mostly limited to the English language. Constraining RE to a single language inhibits utilization of large amounts of data in other languages which could allow extraction of more diverse facts. Very recently, a dataset for multilingual DS-RE has been released. However, our analysis reveals that the proposed dataset exhibits unrealistic characteristics such as 1) lack of sentences that do not express any relation, and 2) all sentences for a given entity pair expressing exactly one relation. We show that these characteristics lead to a gross overestimation of the model performance. In response, we propose a new dataset, DiS-ReX, which alleviates these issues. Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class. We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE. Unlike the competing dataset, we show that our dataset is challenging and leaves enough room for future research to take place in this field.
翻译:然而,DS-RE的研究大多仅限于英语。将RE限制为一种语言,会抑制使用其他语言的大量数据,从而无法提取更为多样的事实。最近,还发布了多语种DS-RE的数据集。然而,我们的分析显示,拟议的数据集显示出不切实际的特征,例如:(1) 缺乏没有任何关系的任何判决,(2) 特定实体一对的判刑,表达一个完全的关联。我们表明,这些特征导致对模型性能的严重高估。我们对此提出一个新的数据集,即DIS-REX,以缓解这些问题。我们的数据集有超过150万个句子,涉及4种语言,36个关系级+1没有关系(NA)级。我们还修改了广泛使用的袋式注意模型,用 mBERT对句进行编码,并提供了多语种DS-RE的第一个基准结果。与相互竞争的数据设置不同,我们显示,我们的数据设置具有挑战性,离未来研究空间足够远。