Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could help improve the quality of distantly supervised models. We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models. Specifically, we propose a new criterion for clean instance selection based on influence functions. It collects sample-level evidence for recognizing good instances (which is more informative than loss-level evidence). We also propose a teacher-student mechanism for controlling purity of intermediate results when bootstrapping the clean set. The whole approach is model-agnostic and demonstrates strong performances on both denoising real (NYT) and synthetic noisy datasets.
翻译:现有的远距监控关系提取器通常依赖噪音数据进行模型培训和评估,这可能导致垃圾垃圾垃圾排出系统。为了缓解问题,我们研究一个小型清洁数据集是否有助于提高远距监控模型的质量。我们表明,除了对模型进行更令人信服的评估外,一个小型清洁数据集还帮助我们建立更可靠的分泌模型。具体地说,我们提出了基于影响功能的清洁实例选择新标准。它收集了识别好实例的样本级证据(比损失水平证据更丰富 ) 。我们还提出了一个师资-学生机制,以控制在穿梭清洁数据集时的中间结果的纯度。整个方法都是模型不可知性,并展示了在去除真实数据(NYT)和合成噪音数据集方面的强效表现。