DocRED is a widely used dataset for document-level relation extraction. In the large-scale annotation, a \textit{recommend-revise} scheme is adopted to reduce the workload. Within this scheme, annotators are provided with candidate relation instances from distant supervision, and they then manually supplement and remove relational facts based on the recommendations. However, when comparing DocRED with a subset relabeled from scratch, we find that this scheme results in a considerable amount of false negative samples and an obvious bias towards popular entities and relations. Furthermore, we observe that the models trained on DocRED have low recall on our relabeled dataset and inherit the same bias in the training data. Through the analysis of annotators' behaviors, we figure out the underlying reason for the problems above: the scheme actually discourages annotators from supplementing adequate instances in the revision phase. We appeal to future research to take into consideration the issues with the recommend-revise scheme when designing new models and annotation schemes. The relabeled dataset is released at \url{https://github.com/AndrewZhe/Revisit-DocRED}, to serve as a more reliable test set of document RE models.
翻译:DocRED是一个广泛使用的用于文件级关系提取的数据集。 在大规模注解中,采用了一个“textit{recommend- revise}”办法来减少工作量。在这个办法中,向通知员提供来自遥远监督的候选关系案例,然后根据建议人工补充和删除关联事实。然而,在将DocRED与一个从零开始重新贴标签的子集进行比较时,我们发现这个办法产生了大量虚假的负面样本,明显偏向于受欢迎的实体和关系。此外,我们发现,在DocRED上培训的模型很少记得我们重新标签的数据集,在培训数据中继承同样的偏差。我们通过分析说明员的行为,找出上述问题的根本原因:这个办法实际上阻止通知员在修订阶段补充适当的案例。我们呼吁今后的研究在设计新的模型和注解计划时考虑建议性复议方案的问题。重新贴标签的数据集在\url{https://github.com/Andrew/Revisimi}作为更可靠的文件的测试集。