Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data sets such as the public internet and its web pages, or what is preserved thereof by the Internet Archive. Recently, the Webis research group, a network of university chairs to which the authors belong, concluded an agreement with the Internet Archive to download a substantial part of its web archive for research purposes. The second part of the paper in hand describes our infrastructure for processing this data treasure: We will eventually host around 8 PB of web archive data from the Internet Archive and Common Crawl, with the goal of supplementing existing large scale web corpora and forming a non-biased subset of the 30 PB web archive at the Internet Archive.
翻译:网络档案分析是利用公众可以查阅的网页及其演变来进行研究 -- -- 研究人员在组织上尽可能了解这项任务的复杂性。为了更好地了解这项任务的复杂性,本文件第一部分列出全世界收集、创建和复制的数据(“全球数据信息”),涉及其他重要数据集,如公共互联网及其网页,或因特网档案馆保存的数据。最近,作者所属的大学教席网络Webis研究小组与因特网档案馆签订了一项协议,下载其大部分网络档案,用于研究目的。文件第二部分描述了我们处理这一数据库的基础设施:我们最终将主办大约8个PB网络档案数据库,来自因特网档案馆和共同图书馆,目的是补充现有的大型网络公司,并在因特网档案馆建立30个PB网络档案中一个无偏见的子集。