Distantly supervision automatically generates plenty of training samples for relation extraction. However, it also incurs two major problems: noisy labels and imbalanced training data. Previous works focus more on reducing wrongly labeled relations (false positives) while few explore the missing relations that are caused by incompleteness of knowledge base (false negatives). Furthermore, the quantity of negative labels overwhelmingly surpasses the positive ones in previous problem formulations. In this paper, we first provide a thorough analysis of the above challenges caused by negative data. Next, we formulate the problem of relation extraction into as a positive unlabeled learning task to alleviate false negative problem. Thirdly, we propose a pipeline approach, dubbed \textsc{ReRe}, that performs sentence-level relation detection then subject/object extraction to achieve sample-efficient training. Experimental results show that the proposed method consistently outperforms existing approaches and remains excellent performance even learned with a large quantity of false positive samples.
翻译:长期监管自动产生大量用于关系提取的培训样本。 但是,它也产生了两大问题:吵闹的标签和不平衡的培训数据。 先前的工作更侧重于减少标签错误的关系(假阳性),而很少有人探讨知识基础不完整(假阴性)造成的缺失关系。 此外,负面标签的数量大大超过先前问题配方中的正面标签。 在本文中,我们首先对负面数据造成的上述挑战进行透彻的分析。 其次,我们将关系提取问题表述为一种积极的、没有标签的学习任务,以缓解虚假的负面问题。 第三,我们建议采用一种管道方法,即假冒的\ textsc{Re},先进行判决级关系检测,然后进行主体/对象提取,然后进行样本效率高的培训。 实验结果显示,拟议的方法始终比现有方法完善,即使用大量假正性样本学习,也保持出色的业绩。