Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.
翻译:每天有上百万人阅读维基百科。 当使用超链接浏览大量可用专题空间时, 读者会描述文章网络的轨迹。 了解这些导航模式对于更好地满足读者的需求并解决结构性偏差和知识差距至关重要。 然而, 维基百科的系统导航研究由于承诺不储存或共享潜在敏感数据以保护读者隐私而缺乏公开数据而受阻。 在本文中, 我们问 : 使用公开可用的资源, 特别是维基百科点击流数据, 维基百科读者的导航如何能比得近? 我们系统地量化了实际导航序列和从点击流数据中生成的合成序列之间的差异, 在8个维基百科语言版本的6个分析中。 总的来说, 我们发现真实和合成序列之间的差异具有统计意义, 但影响小于10%。 这构成了维基百科点击流数据作为公共资源的效用的定量证据: 点击流数据可以密切地捕捉维基百科的读者导航, 并为依赖读者数据的最实用的下游应用提供足够近度的近度。 。 更广泛地说, 本研究提供了一个实例, 如何点击流类数据可以让用户在保护在线平台上搜索平台上进行隐私的研究。