The World Wide Web is not only one of the most important platforms of communication and information at present, but also an area of growing interest for scientific research. This motivates a lot of work and projects that require large amounts of data. However, there is no dataset that integrates the parameters and visual appearance of Web pages, because its collection is a costly task in terms of time and effort. With the support of various computer tools and programming scripts, we have created a large dataset of 49,438 Web pages. It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics such as art, entertainment, economy, business, education, government, news, media, science, and environment, covering different cultural characteristics and varied design preferences. In this paper, we describe the process of collecting, debugging and publishing the final product, which is freely available. To demonstrate the usefulness of our dataset, we expose a binary classification model for detecting error Web pages, and a multi-class Web subject-based categorization, both problems using convolutional neural networks.
翻译:万维网不仅是目前最重要的通讯和信息平台之一,而且也是一个对科学研究越来越感兴趣的领域,它激励着许多需要大量数据的工作和项目。然而,没有将网页参数和视觉外观综合在一起的数据集,因为其收集在时间和努力方面是一项昂贵的任务。在各种计算机工具和编程脚本的支持下,我们创建了49 438个网页的庞大数据集。它包括视觉、文字和数字数据类型,包括全世界所有国家,并且考虑到艺术、娱乐、经济、商业、教育、政府、新闻、媒体、科学和环境等范围广泛的主题,涵盖不同的文化特点和不同的设计偏好。我们在本文件中描述了收集、调试和出版最后产品的过程,这是免费的。为了展示我们数据集的效用,我们暴露了一种用于检测错误网页的二元分类模型,以及一种基于多级网络主题的分类,两者都是使用同源神经网络的问题。