In this paper we present a technique to couple non-traditional data with statistics based on survey data, in order to partially correct for the bias produced by non-random sample selections. All major social media platforms represent huge samples of the general population, generated by a self-selection process. This implies that they are not representative of the larger public, and there are problems in extrapolating conclusions drawn from these samples to the whole population. We present an algorithm to integrate these massive data with ones coming from traditional sources, with the properties of being less extensive but more reliable. This integration allows to exploit the best of both worlds and reach the detail of typical "big data" sources and the representativeness of a carefully designed sample survey.
翻译:在本文中,我们提出了一个将非传统数据与基于调查数据的统计数据相结合的方法,以便部分纠正非随机抽样选择所产生的偏见。所有主要社交媒体平台都是由自我选择过程产生的广大大众的样本,这意味着它们不代表广大民众,从这些样本中得出的结论难以向全体民众推断。我们提出了一个算法,将这些大量数据与来自传统来源的数据结合起来,其性质不那么广泛,但更加可靠。这种结合使得能够利用两个世界的最好数据,了解典型的“大数据”来源的细节以及精心设计的抽样调查的代表性。