维基百科阅读器导航:当合成数据足够时 (Wikipedia Reader Navigation: When Synthetic Data Is Enough) - 专知论文

会员服务 ·

0

维基百科 · Notability · 近似 · Better · 可理解性 ·

2022 年 1 月 5 日

Wikipedia Reader Navigation: When Synthetic Data Is Enough

翻译：维基百科阅读器导航:当合成数据足够时

Akhil Arora,Martin Gerlach,Tiziano Piccardi,Alberto García-Durán,Robert West

from arxiv, WSDM 2022, 11 pages, 16 figures

Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.

翻译：每天有上百万人阅读维基百科。当使用超链接浏览大量可用专题空间时, 读者会描述文章网络的轨迹。了解这些导航模式对于更好地满足读者的需求并解决结构性偏差和知识差距至关重要。然而, 维基百科的系统导航研究由于承诺不储存或共享潜在敏感数据以保护读者隐私而缺乏公开数据而受阻。在本文中, 我们问 : 使用公开可用的资源, 特别是维基百科点击流数据, 维基百科读者的导航如何能比得近? 我们系统地量化了实际导航序列和从点击流数据中生成的合成序列之间的差异, 在8个维基百科语言版本的6个分析中。总的来说, 我们发现真实和合成序列之间的差异具有统计意义, 但影响小于10%。这构成了维基百科点击流数据作为公共资源的效用的定量证据: 点击流数据可以密切地捕捉维基百科的读者导航, 并为依赖读者数据的最实用的下游应用提供足够近度的近度。。更广泛地说, 本研究提供了一个实例, 如何点击流类数据可以让用户在保护在线平台上搜索平台上进行隐私的研究。

0

相关内容

维基百科

维基百科（ http://Wikipedia.org）是一个基于 Wiki 技术的全球性多语言百科全书协作项目，同时也是一部在网际网络上呈现的网络百科全书网站，其目标及宗旨是为全人类提供自由的百科全书。目前 Alexa 全球网站排名第六。

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

【干货书】数据科学基础，429页pdf，Foundations of Data Science

专知会员服务

65+阅读 · 2021年8月11日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk2

【ICIG2021】Latest News & Announcements of the Industry Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年7月29日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

TMS1基因响应高温胁迫和ER Stress的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

致病同义突变数据库与分析平台构建

国家自然科学基金

1+阅读 · 2014年12月31日

MiRNA多态性与复发性流产发病风险的关系及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Arisandilactone A 的不对称全合成

国家自然科学基金

0+阅读 · 2012年12月31日

整合常见和罕见变异进行肺癌风险预测的统计方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

蜡样芽孢杆菌Bacillus cereus 905锰超氧化物歧化酶（MnSOD）基因表达调控途径的研究

国家自然科学基金

0+阅读 · 2011年12月31日

因果推断的统计方法

国家自然科学基金

26+阅读 · 2011年12月31日

艾滋病TH17/Treg失衡与STAT/SOCS调控及补肾解毒法的干预作用

国家自然科学基金

0+阅读 · 2011年12月31日

噪声导致的耳蜗毛细胞线粒体损伤及其介导毛细胞死亡的机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

高通量基因数据分析中的 Bayes 统计方法

国家自然科学基金

1+阅读 · 2008年12月31日

Effects of Graph Convolutions in Deep Networks

Arxiv

0+阅读 · 2022年4月20日

Prespecification of Structure for Optimizing Data Collection and Research Transparency by Leveraging Conditional Independencies

Arxiv

0+阅读 · 2022年4月19日

Active Learning Helps Pretrained Models Learn the Intended Task

Arxiv

1+阅读 · 2022年4月18日

CryoRL: Reinforcement Learning Enables Efficient Cryo-EM Data Collection

CryoRL: Reinforcement Learning Enables Efficient Cryo-EM Data Collection

Arxiv

0+阅读 · 2022年4月15日

A Reinforcement Learning Approach to Parameter Selection for Distributed Optimal Power Flow

Arxiv

0+阅读 · 2022年4月15日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

The Causal Learning of Retail Delinquency

Arxiv

15+阅读 · 2020年12月17日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

VIP会员

文章信息

相关主题

相关VIP内容

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

【干货书】数据科学基础，429页pdf，Foundations of Data Science

专知会员服务

65+阅读 · 2021年8月11日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《用于提升多域战备的大型语言模型辅助场景生成器》报告

《通过适应复杂环境与特种作战动态变革情报周期》报告

国防领域人工智能规模化应用的理论与实践

《多域作战背景下集体防御作战规划流程的建模与仿真、兵棋推演及人工智能方法》

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

【新书发布】原作者MarcG.Bellemare发布315页分布强化学习书籍(DistributionalRL)

深度强化学习实验室

1+阅读 · 2022年1月11日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk2

【ICIG2021】Latest News & Announcements of the Industry Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年7月29日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Effects of Graph Convolutions in Deep Networks

Arxiv

0+阅读 · 2022年4月20日

Prespecification of Structure for Optimizing Data Collection and Research Transparency by Leveraging Conditional Independencies

Arxiv

0+阅读 · 2022年4月19日

Active Learning Helps Pretrained Models Learn the Intended Task

Arxiv

1+阅读 · 2022年4月18日

CryoRL: Reinforcement Learning Enables Efficient Cryo-EM Data Collection

CryoRL: Reinforcement Learning Enables Efficient Cryo-EM Data Collection

Arxiv

0+阅读 · 2022年4月15日

A Reinforcement Learning Approach to Parameter Selection for Distributed Optimal Power Flow

Arxiv

0+阅读 · 2022年4月15日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

The Causal Learning of Retail Delinquency

Arxiv

15+阅读 · 2020年12月17日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

相关基金

TMS1基因响应高温胁迫和ER Stress的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

致病同义突变数据库与分析平台构建

国家自然科学基金

1+阅读 · 2014年12月31日

MiRNA多态性与复发性流产发病风险的关系及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Arisandilactone A 的不对称全合成

国家自然科学基金

0+阅读 · 2012年12月31日

整合常见和罕见变异进行肺癌风险预测的统计方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

蜡样芽孢杆菌Bacillus cereus 905锰超氧化物歧化酶（MnSOD）基因表达调控途径的研究

国家自然科学基金

0+阅读 · 2011年12月31日

因果推断的统计方法

国家自然科学基金

26+阅读 · 2011年12月31日

艾滋病TH17/Treg失衡与STAT/SOCS调控及补肾解毒法的干预作用

国家自然科学基金

0+阅读 · 2011年12月31日

噪声导致的耳蜗毛细胞线粒体损伤及其介导毛细胞死亡的机制研究

国家自然科学基金

0+阅读 · 2009年12月31日

高通量基因数据分析中的 Bayes 统计方法

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员