ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, ClueWeb22 includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text to lower the barrier to entry. Many of these signals have been widely used in industry but are available to the research community for the first time at this scale.
翻译:ClueWeb22是ClueWeb系列数据集的最新版本,它提供了100亿个与丰富信息相联的网页,其设计受到需要高质量的大型网络资料的影响,以支持一系列学术和行业研究,例如信息系统、检索增强的AI系统和模型预修。与早先的ClueWeb Corpora相比,ClueWeb22资料较广泛、更多样化、质量更高,并与商业网络搜索中的文件分发相匹配。除了原始的HTML外,ClueWeb22还包含行业标准文件理解系统提供的关于网页的丰富信息,包括网络浏览器提供的网页的视觉显示、神经网络分析器提供的预处理的HTML结构资料,以及降低进入屏障的经过加工的文件文本。许多这些信号在工业中被广泛使用,但在此规模上首次提供给研究界。