In recent years, the interest in Big Data sources has been steadily growing within the Official Statistic community. The Italian National Institute of Statistics (Istat) is currently carrying out several Big Data pilot studies. One of these studies, the ICT Big Data pilot, aims at exploiting massive amounts of textual data automatically scraped from the websites of Italian enterprises in order to predict a set of target variables (e.g. e-commerce) that are routinely observed by the traditional ICT Survey. In this paper, we show that Deep Learning techniques can successfully address this problem. Essentially, we tackle a text classification task: an algorithm must learn to infer whether an Italian enterprise performs e-commerce from the textual content of its website. To reach this goal, we developed a sophisticated processing pipeline and evaluated its performance through extensive experiments. Our pipeline uses Convolutional Neural Networks and relies on Word Embeddings to encode raw texts into grayscale images (i.e. normalized numeric matrices). Web-scraped texts are huge and have very low signal to noise ratio: to overcome these issues, we adopted a framework known as False Positive Reduction, which has seldom (if ever) been applied before to text classification tasks. Several original contributions enable our processing pipeline to reach good classification results. Empirical evidence shows that our proposal outperforms all the alternative Machine Learning solutions already tested in Istat for the same task.
翻译:近年来,官方统计界对大数据源的兴趣稳步增加。意大利国家统计局(Istat)目前正在开展几项大数据试点研究。其中一项研究,即信通技术大数据试点,旨在利用意大利企业网站自动剪掉的大量文本数据(例如电子商务),以预测传统信通技术调查经常观察到的一系列目标变量(例如,电子商业)。在本文中,我们显示深学习技术可以成功地解决这一问题。基本上,我们处理的是文本分类任务:一个算法必须从意大利企业网站的文本内容中推断出该企业是否从事电子商务。为了实现这一目标,我们开发了一个精密的处理管道,并通过广泛的实验评价其绩效。我们管道使用进化神经网络,依靠Word Embeddings将原始文本编码成灰度图像(即标准化数字矩阵) 。在本文中,网络剪裁的文本是巨大的,对噪音比率的信号非常低:为了克服这些问题,我们采用了一个被称为假正向减少的框架,我们通过大量试验的版本,我们很少能将原始文本用于完成所有的学习。