While applications of big data analytics have brought many new opportunities to economic research, with datasets containing millions of observations, making usual econometric inferences based on extreme estimators would require huge computing powers and memories that are often not accessible. In this paper, we focus on linear quantile regression employed to analyze "ultra-large" datasets such as U.S. decennial censuses. We develop a new inference framework that runs very fast, based on the stochastic sub-gradient descent (S-subGD) updates. The cross-sectional data are treated sequentially into the inference procedure: (i) the parameter estimate is updated when each "new observation" arrives, (ii) it is aggregated as the Polyak-Ruppert average, and (iii) a pivotal statistic for inference is computed using a solution path only. We leverage insights from time series regression and construct an asymptotically pivotal statistic via random scaling. Our proposed test statistic is computed in a fully online fashion and the critical values are obtained without any resampling methods. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method can generate new insights beyond the computational capabilities of existing inference methods. Specifically, we uncover the trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.
翻译:虽然应用大数据分析为经济研究带来了许多新的机会,而数据集包含数百万的观测结果,根据极端估计结果进行通常的计量经济学推断,需要大量往往无法获取的计算能力和记忆。在本文中,我们侧重于用于分析“超大”数据集(如美国十年人口普查)的线性四分位回归。我们利用时间序列回归的洞察力和通过随机缩放构建一个微缩缩缩缩缩缩缩缩略图。我们拟议的工资统计(S-SubGD)以完全在线的方式进行计算,关键值则按顺序处理为推断程序:(一)每次“新观察”到来时,参数估计将更新,(二)以聚氨-鲁珀特平均数汇总,(三)仅用解析路径来计算推断“超大”数据集。(10x)在计算现有数值时,(xxx)的计算方法将降低现有数值的数值。(xx)在计算方法中,我们现有的数值的数值分析能力将降低为10美元。