通过深学习实现网站自动化分类 (Towards Automated Website Classification by Deep Learning)

In recent years, the interest in Big Data sources has been steadily growing within the Official Statistic community. The Italian National Institute of Statistics (Istat) is currently carrying out several Big Data pilot studies. One of these studies, the ICT Big Data pilot, aims at exploiting massive amounts of textual data automatically scraped from the websites of Italian enterprises in order to predict a set of target variables (e.g. e-commerce) that are routinely observed by the traditional ICT Survey. In this paper, we show that Deep Learning techniques can successfully address this problem. Essentially, we tackle a text classification task: an algorithm must learn to infer whether an Italian enterprise performs e-commerce from the textual content of its website. To reach this goal, we developed a sophisticated processing pipeline and evaluated its performance through extensive experiments. Our pipeline uses Convolutional Neural Networks and relies on Word Embeddings to encode raw texts into grayscale images (i.e. normalized numeric matrices). Web-scraped texts are huge and have very low signal to noise ratio: to overcome these issues, we adopted a framework known as False Positive Reduction, which has seldom (if ever) been applied before to text classification tasks. Several original contributions enable our processing pipeline to reach good classification results. Empirical evidence shows that our proposal outperforms all the alternative Machine Learning solutions already tested in Istat for the same task.

翻译：近年来,官方统计界对大数据源的兴趣稳步增加。意大利国家统计局(Istat)目前正在开展几项大数据试点研究。其中一项研究,即信通技术大数据试点,旨在利用意大利企业网站自动剪掉的大量文本数据(例如电子商务),以预测传统信通技术调查经常观察到的一系列目标变量(例如,电子商业)。在本文中,我们显示深学习技术可以成功地解决这一问题。基本上,我们处理的是文本分类任务:一个算法必须从意大利企业网站的文本内容中推断出该企业是否从事电子商务。为了实现这一目标,我们开发了一个精密的处理管道,并通过广泛的实验评价其绩效。我们管道使用进化神经网络,依靠Word Embeddings将原始文本编码成灰度图像(即标准化数字矩阵) 。在本文中,网络剪裁的文本是巨大的,对噪音比率的信号非常低:为了克服这些问题,我们采用了一个被称为假正向减少的框架,我们通过大量试验的版本,我们很少能将原始文本用于完成所有的学习。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

专知会员服务

39+阅读 · 2020年11月3日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

深度学习自然语言处理综述论文，Natural Language Processing Advancements By Deep Learning: A Survey

专知会员服务

80+阅读 · 2020年3月5日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation