网页的大型视觉、定性和定量数据集 (A Large Visual, Qualitative and Quantitative Dataset of Web Pages)

The World Wide Web is not only one of the most important platforms of communication and information at present, but also an area of growing interest for scientific research. This motivates a lot of work and projects that require large amounts of data. However, there is no dataset that integrates the parameters and visual appearance of Web pages, because its collection is a costly task in terms of time and effort. With the support of various computer tools and programming scripts, we have created a large dataset of 49,438 Web pages. It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics such as art, entertainment, economy, business, education, government, news, media, science, and environment, covering different cultural characteristics and varied design preferences. In this paper, we describe the process of collecting, debugging and publishing the final product, which is freely available. To demonstrate the usefulness of our dataset, we expose a binary classification model for detecting error Web pages, and a multi-class Web subject-based categorization, both problems using convolutional neural networks.

翻译：万维网不仅是目前最重要的通讯和信息平台之一,而且也是一个对科学研究越来越感兴趣的领域,它激励着许多需要大量数据的工作和项目。然而,没有将网页参数和视觉外观综合在一起的数据集,因为其收集在时间和努力方面是一项昂贵的任务。在各种计算机工具和编程脚本的支持下,我们创建了49 438个网页的庞大数据集。它包括视觉、文字和数字数据类型,包括全世界所有国家,并且考虑到艺术、娱乐、经济、商业、教育、政府、新闻、媒体、科学和环境等范围广泛的主题,涵盖不同的文化特点和不同的设计偏好。我们在本文件中描述了收集、调试和出版最后产品的过程,这是免费的。为了展示我们数据集的效用,我们暴露了一种用于检测错误网页的二元分类模型,以及一种基于多级网络主题的分类,两者都是使用同源神经网络的问题。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【图神经网络概览】《Graph Neural Networks - An overview | AI Summer》

专知会员服务

54+阅读 · 2020年2月18日