While applications of big data analytics have brought many new opportunities to economic research, with datasets containing tens of millions of observations, making usual econometric inferences based on extreme estimators would require huge computing powers and memories that are often not accessible. In this paper, we focus on linear quantile regression employed to analyze "ultra-large" datasets such as U.S. decennial censuses. We develop a fast inference framework based on the stochastic sub-gradient descent (S-subGD) updates. The cross-sectional data are treated sequentially into the inference procedure: (i) the parameter estimate is updated when each "new observation" arrives, (ii) it is aggregated as the Polyak-Ruppert average, and (iii) a pivotal statistic for inference is computed using a solution path only. We leverage insights from time series regression and construct an asymptotically pivotal statistic via random scaling. Our proposed test statistic is computed in a fully online fashion and the critical values are obtained without any resampling methods. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method can generate new insights beyond the computational capabilities of existing inference methods. Specifically, we uncover the trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.
翻译:虽然应用大数据分析为经济研究带来了许多新的机会,但数据集包含数以百万计的观测结果,根据极端估计结果得出通常的计量经济学推论,需要大量往往无法获取的计算力和记忆。在本文中,我们侧重于用于分析“超大”数据集的线性量化回归,如美国十年人口普查。我们开发了一个快速推论框架,其依据是微小的次梯位(S-subGD)更新。跨部门数据按顺序处理,进入推断程序:(一)每次“新观察”到来时,参数估计会更新,(二)它会以Polyak-Ruppert平均数汇总,以及(三)用于分析“超大”数据集的临界参数回归。我们利用时间序列回归的洞察力,并通过随机缩放构建一个“低位”关键值统计。我们提议的测试数据以完全在线的方式进行计算,而关键值则在不采用任何重度方法的情况下获得。(一)当每次“新观察到达”时,参数估计值的参数将更新,(三)我们进行广泛的数值研究时,在以美元计算的计算方法中,在美元计算方法中,将数值的数值的数值分析结果中将显示(10美元的数值的数值的数值的数值值的数值值的数值值的数值值的数值值值值的数值的数值的数值值的数值值的计算法值的计算能力将降低。