DiS-ReX:用于隐蔽监督关系采掘的多语种数据集 (DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction)

Distant supervision (DS) is a well established technique for creating large-scale datasets for relation extraction (RE) without using human annotations. However, research in DS-RE has been mostly limited to the English language. Constraining RE to a single language inhibits utilization of large amounts of data in other languages which could allow extraction of more diverse facts. Very recently, a dataset for multilingual DS-RE has been released. However, our analysis reveals that the proposed dataset exhibits unrealistic characteristics such as 1) lack of sentences that do not express any relation, and 2) all sentences for a given entity pair expressing exactly one relation. We show that these characteristics lead to a gross overestimation of the model performance. In response, we propose a new dataset, DiS-ReX, which alleviates these issues. Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class. We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE. Unlike the competing dataset, we show that our dataset is challenging and leaves enough room for future research to take place in this field.

翻译：然而,DS-RE的研究大多仅限于英语。将RE限制为一种语言,会抑制使用其他语言的大量数据,从而无法提取更为多样的事实。最近,还发布了多语种DS-RE的数据集。然而,我们的分析显示,拟议的数据集显示出不切实际的特征,例如:(1) 缺乏没有任何关系的任何判决,(2) 特定实体一对的判刑,表达一个完全的关联。我们表明,这些特征导致对模型性能的严重高估。我们对此提出一个新的数据集,即DIS-REX,以缓解这些问题。我们的数据集有超过150万个句子,涉及4种语言,36个关系级+1没有关系(NA)级。我们还修改了广泛使用的袋式注意模型,用 mBERT对句进行编码,并提供了多语种DS-RE的第一个基准结果。与相互竞争的数据设置不同,我们显示,我们的数据设置具有挑战性,离未来研究空间足够远。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

最新《自监督表示学习》报告，70页ppt

专知会员服务

86+阅读 · 2020年12月22日

【ACL2020】Span-ConveRT：预训练对话表示小样本跨度提取，Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

专知会员服务

17+阅读 · 2020年5月19日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

专知会员服务

34+阅读 · 2020年4月5日