National statistical institutes are beginning to use non-traditional data sources to produce official statistics. These sources, originally collected for non-statistical purposes, include point-of-sales(POS) data and mobile phone global positioning system(GPS) data. Such data have the potential to significantly enhance the usefulness of official statistics. In the era of big data, many private companies are accumulating vast amounts of transaction data. Exploring how to leverage these data for official statistics is increasingly important. However, progress has been slower than expected, mainly because such data are not collected through sample-based survey methods and therefore exhibit substantial selection bias. If this bias can be properly addressed, these data could become a valuable resource for official statistics, substantially expanding their scope and improving the quality of decision-making, including economic policy. This paper demonstrates that even biased transaction data can be useful for producing official statistics for prompt release, by drawing on the concepts of density ratio estimation and supervised learning under covariate shift, both developed in the field of machine learning. As a case study, we show that preliminary statistics can be produced in a timely manner using biased data from a Japanese private employment agency. This approach enables the early release of a key labor market indicator that would otherwise be delayed by up to a year, thereby making it unavailable for timely decision-making.
翻译:各国统计机构正开始利用非传统数据源生产官方统计数据。这些最初为非统计目的收集的数据源包括销售点(POS)数据和移动电话全球定位系统(GPS)数据。此类数据有望显著提升官方统计的实用价值。在大数据时代,许多私营企业积累了海量交易数据。探索如何将这些数据应用于官方统计变得日益重要。然而,进展速度低于预期,主要原因是此类数据并非通过基于样本的调查方法收集,因而存在显著的选择偏差。若能妥善处理这种偏差,这些数据将成为官方统计的宝贵资源,大幅拓展统计范围并提升包括经济政策在内的决策质量。本文通过借鉴机器学习领域发展的密度比估计和协变量偏移下的监督学习概念,论证了即使存在偏差的交易数据也能用于生产及时发布的官方统计数据。作为案例研究,我们展示了如何利用日本私营就业机构的有偏数据及时生成初步统计数据。该方法使得关键劳动力市场指标能够提前发布——原本可能延迟长达一年——从而为及时决策提供支持。