Wikidata has been increasingly adopted by many communities for a wide variety of applications, which demand high-quality knowledge to deliver successful results. In this paper, we develop a framework to detect and analyze low-quality statements in Wikidata by shedding light on the current practices exercised by the community. We explore three indicators of data quality in Wikidata, based on: 1) community consensus on the currently recorded knowledge, assuming that statements that have been removed and not added back are implicitly agreed to be of low quality; 2) statements that have been deprecated; and 3) constraint violations in the data. We combine these indicators to detect low-quality statements, revealing challenges with duplicate entities, missing triples, violated type rules, and taxonomic distinctions. Our findings complement ongoing efforts by the Wikidata community to improve data quality, aiming to make it easier for users and editors to find and correct mistakes.
翻译:许多社区越来越多地采用维基数据进行各种各样的应用,这些应用要求高质量的知识才能取得成功结果。在本文中,我们开发了一个框架,通过说明社区目前的做法来检测和分析维基数据中的低质量声明。我们在维基数据中探索了三个数据质量指标,其依据是:(1) 社区对目前记录的知识达成共识,假设隐含地同意已删除和未添加的言论为低质量;(2) 过时的言论;和(3) 限制数据中的违规现象。我们将这些指数结合起来,以发现低质量声明,揭示与重复实体、三重缺失、违反类型规则和分类区分的挑战。我们的调查结果补充了维基数据社区目前为提高数据质量所做的努力,目的是使用户和编辑更容易发现和纠正错误。