维基百科阅读器导航:当合成数据足够时 (Wikipedia Reader Navigation: When Synthetic Data Is Enough)

Every day millions of people read Wikipedia. When navigating the vast space of available topics using embedded hyperlinks, readers follow different trajectories in terms of the sequence of articles. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation in Wikipedia are limited because of a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we address the question: how well navigation of readers can be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the difference between real and synthetic navigation sequences generated from the clickstream data, through 6 different experiments across 8 Wikipedia language versions. Overall, we find that these differences are statistically significant but the effect sizes are small often well within 10%. We thus provide quantitative evidence for the utility of the Wikipedia clickstream data as a public resource by showing that it can closely capture reader navigation on Wikipedia, and constitute a sufficient approximation for most practical downstream applications relying on data from readers. More generally, our study provides an example for how clickstream-like data can empower broader research on navigation in other online platforms while protecting users' privacy.

翻译：每天有上百万人阅读维基百科。当使用嵌入超链接浏览大量可用专题空间时, 读者在文章序列方面遵循不同的轨迹。理解这些导航模式对于更好地满足读者的需求并解决结构性偏差和知识差距至关重要。然而, 维基百科的系统导航研究有限, 原因是缺乏公开的数据, 因为它承诺通过不储存或共享潜在敏感数据来保护读者隐私, 从而保护读者隐私。在本文中, 我们处理的问题是: 使用公开可得的资源, 特别是维基百科点击流数据, 读者的浏览量可以比对读者的浏览量要好得多? 我们通过8个维基百科语言版本的6种不同实验, 系统地量化从点击流数据中产生的真实和合成导航序列之间的差异。总的来说, 我们发现这些差异具有统计意义, 但影响大小通常小于10%。因此, 我们提供了数量证据, 维基百科点击流流数据作为公共资源的有用性, 显示它能够密切捕捉到维基百科的读者导航, 并构成依赖读者数据的最实用的下游应用的下游应用数据的充分近度。。更一般, 我们的研究提供了一个实例, 如何点击流式数据可以在保护其他在线平台上的用户在保护其他的隐私平台上进行更广泛的研究。