Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.
翻译:维基数据是全世界志愿人员社区建立的网上结构数据的最重要来源之一。作为第二来源,其内容必须以可信的参考资料作为辅助来源;这一点特别重要,因为维基数据明确鼓励编辑增加没有达成广泛共识的主张,只要这些主张得到参考的证实。然而,尽管内容和参考文献之间有着重要的联系,维基数据系统评估和确保其参考文献质量的能力仍然有限。为此,我们开展了一项混合方法研究,以确定维基数据参考资料在规模和不同语言中的关联性、易获取性和权威性,利用在线众包、描述性统计和机器学习。我们利用我们以往的工作,进行一系列微观任务实验,评价大量参考资料,从维基数据三部和若干语言标签的三部样本中抽取。我们使用集中来源评估的综合版本,以培训若干机算学习模型,以扩大对维基数据整体的分析。调查结果有助于我们确定维基数据数据库和不同语言参考文献的参考文献质量,我们根据我们以往的工作,在界定和获取更高质量的数据方面,我们用更多多语种数据的方法来评估。