Web 归档分析器 (Web Archive Analytics) - 专知论文

会员服务 ·

0

Internet Archive · Better · 可理解性 · Processing（编程语言） · GROUP ·

2021 年 7 月 2 日

Web Archive Analytics

翻译：Web 归档分析器

Michael Völske,Janek Bevendorff,Johannes Kiesel,Benno Stein,Maik Fröbe,Matthias Hagen,Martin Potthast

from arxiv, 12 pages, 5 figures. Published in the proceedings of INFORMATIK 2020

Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.

翻译：网络档案分析是利用公众可以查阅的网页及其演变来进行研究 -- -- 研究人员在组织上尽可能了解这项任务的复杂性。为了更好地了解这项任务的复杂性,本文件第一部分列出全世界收集、创建和复制的数据(“全球数据信息”),涉及其他重要数据集,如公共互联网及其网页,或因特网档案馆保存的数据。最近,作者所属的大学教席网络Webis研究小组与因特网档案馆签订了一项协议,下载其大部分网络档案,用于研究目的。文件第二部分描述了我们处理这一数据库的基础设施:我们最终将主办大约8个PB网络档案数据库,来自因特网档案馆和共同图书馆,目的是补充现有的大型网络公司,并在因特网档案馆建立30个PB网络档案中一个无偏见的子集。

0

相关内容

Internet Archive

Internet Archive

【经典书】线性代数，436页pdf

专知会员服务

78+阅读 · 2021年3月16日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【斯坦福大学】《海量数据集挖掘》电子书及相关资源《Mining of Massive Datasets》

【斯坦福大学】《海量数据集挖掘》电子书及相关资源《Mining of Massive Datasets》

专知会员服务

81+阅读 · 2020年3月30日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【2019/2020之交的机器学习/深度学习技术概述】《2019 In-Review and Trends for 2020 – A Technical Overview of Machine Learning and Deep Learning!》by Analytics Vidhya

【2019/2020之交的机器学习/深度学习技术概述】《2019 In-Review and Trends for 2020 – A Technical Overview of Machine Learning and Deep Learning!》by Analytics Vidhya

专知会员服务

21+阅读 · 2020年2月1日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

学术报告|港科大助理教授宋阳秋博士

学术报告|港科大助理教授宋阳秋博士

科技创新与创业

7+阅读 · 2019年7月19日

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

人工智能 | NIPS 2019等国际会议信息8条

人工智能 | NIPS 2019等国际会议信息8条

Call4Papers

7+阅读 · 2019年3月21日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

人工智能 | 国际会议/SCI期刊约稿信息9条

人工智能 | 国际会议/SCI期刊约稿信息9条

Call4Papers

3+阅读 · 2018年1月12日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Optimal subgroup selection

Arxiv

0+阅读 · 2021年9月2日

The Deep Learning Compiler: A Comprehensive Survey

Arxiv

5+阅读 · 2020年8月28日

GIANT: Scalable Creation of a Web-scale Ontology

Arxiv

10+阅读 · 2020年4月5日

A Survey on Trajectory Data Management, Analytics, and Learning

A Survey on Trajectory Data Management, Analytics, and Learning

Arxiv

16+阅读 · 2020年3月25日

Blockchain for Future Smart Grid: A Comprehensive Survey

Blockchain for Future Smart Grid: A Comprehensive Survey

Arxiv

21+阅读 · 2019年11月8日

Knowledge Flow: Improve Upon Your Teachers

Knowledge Flow: Improve Upon Your Teachers

Arxiv

5+阅读 · 2019年4月11日

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Arxiv

4+阅读 · 2019年3月27日

Expeditious Generation of Knowledge Graph Embeddings

Arxiv

7+阅读 · 2018年3月21日

Product Characterisation towards Personalisation: Learning Attributes from Unstructured Data to Recommend Fashion Products

Arxiv

4+阅读 · 2018年3月20日

Big Data: Understanding Big Data

Arxiv

6+阅读 · 2016年1月15日

VIP会员

文章信息

相关主题

Internet Archive

Processing（编程语言）

相关VIP内容

【经典书】线性代数，436页pdf

专知会员服务

78+阅读 · 2021年3月16日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【斯坦福大学】《海量数据集挖掘》电子书及相关资源《Mining of Massive Datasets》

【斯坦福大学】《海量数据集挖掘》电子书及相关资源《Mining of Massive Datasets》

专知会员服务

81+阅读 · 2020年3月30日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【2019/2020之交的机器学习/深度学习技术概述】《2019 In-Review and Trends for 2020 – A Technical Overview of Machine Learning and Deep Learning!》by Analytics Vidhya

【2019/2020之交的机器学习/深度学习技术概述】《2019 In-Review and Trends for 2020 – A Technical Overview of Machine Learning and Deep Learning!》by Analytics Vidhya

专知会员服务

21+阅读 · 2020年2月1日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】多目标奖励与偏好优化：理论与算法

《无形的防御者？将定向能武器集成到反无人机框架的机遇与挑战》报告

自主化海军：海上无人系统与未来海战

迈向智能体系统规模化的科学

相关资讯

学术报告|港科大助理教授宋阳秋博士

学术报告|港科大助理教授宋阳秋博士

科技创新与创业

7+阅读 · 2019年7月19日

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

人工智能 | NIPS 2019等国际会议信息8条

人工智能 | NIPS 2019等国际会议信息8条

Call4Papers

7+阅读 · 2019年3月21日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

人工智能 | 国际会议/SCI期刊约稿信息9条

人工智能 | 国际会议/SCI期刊约稿信息9条

Call4Papers

3+阅读 · 2018年1月12日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

Optimal subgroup selection

Arxiv

0+阅读 · 2021年9月2日

The Deep Learning Compiler: A Comprehensive Survey

Arxiv

5+阅读 · 2020年8月28日

GIANT: Scalable Creation of a Web-scale Ontology

Arxiv

10+阅读 · 2020年4月5日

A Survey on Trajectory Data Management, Analytics, and Learning

A Survey on Trajectory Data Management, Analytics, and Learning

Arxiv

16+阅读 · 2020年3月25日

Blockchain for Future Smart Grid: A Comprehensive Survey

Blockchain for Future Smart Grid: A Comprehensive Survey

Arxiv

21+阅读 · 2019年11月8日

Knowledge Flow: Improve Upon Your Teachers

Knowledge Flow: Improve Upon Your Teachers

Arxiv

5+阅读 · 2019年4月11日

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Arxiv

4+阅读 · 2019年3月27日

Expeditious Generation of Knowledge Graph Embeddings

Arxiv

7+阅读 · 2018年3月21日

Product Characterisation towards Personalisation: Learning Attributes from Unstructured Data to Recommend Fashion Products

Arxiv

4+阅读 · 2018年3月20日

Big Data: Understanding Big Data

Arxiv

6+阅读 · 2016年1月15日

微信扫码咨询专知VIP会员