Factual knowledge graphs (KGs) such as DBpedia and Wikidata have served as part of various downstream tasks and are also widely adopted by artificial intelligence research communities as benchmark datasets. However, we found these KGs to be surprisingly noisy. In this study, we question the quality of these KGs, where the typing error rate is estimated to be 27% for coarse-grained types on average, and even 73% for certain fine-grained types. In pursuit of solutions, we propose an active typing error detection algorithm that maximizes the utilization of both gold and noisy labels. We also comprehensively discuss and compare unsupervised, semi-supervised, and supervised paradigms to deal with typing errors in factual KGs. The outcomes of this study provide guidelines for researchers to use noisy factual KGs. To help practitioners deploy the techniques and conduct further research, we published our code and data.
翻译:DBpedia和Wikidata等事实知识图表(KGs)是各种下游任务的一部分,也被人工智能研究界广泛采用,作为基准数据集。然而,我们发现这些KGs非常吵闹。在本研究中,我们质疑这些KGs的质量,在这种质量中,粗皮类的输入错误率估计平均为27%,某些细细微种类的输入错误率甚至为73%。为了寻求解决办法,我们建议采用一种积极的输入错误检测算法,最大限度地利用黄金和吵闹标签。我们还全面讨论和比较未经监督、半监督和监督的范式,以处理事实KGs中的输入错误。这项研究的结果为研究人员使用吵闹的事实KGs提供了指南。为了帮助从业人员部署技术和进行进一步的研究,我们公布了我们的代码和数据。