Extra-large datasets are becoming increasingly accessible, and computing tools designed to handle huge amount of data efficiently are democratizing rapidly. However, conventional statistical and econometric tools are still lacking fluency when dealing with such large datasets. This paper dives into econometrics on big datasets, specifically focusing on the logistic regression on Spark. We review the robustness of the functions available in Spark to fit logistic regression and introduce a package that we developed in PySpark which returns the statistical summary of the logistic regression, necessary for statistical inference.
翻译:超大型数据集越来越容易获得,旨在高效处理大量数据的计算工具正在迅速民主化,然而,在处理如此庞大的数据集时,传统的统计和计量经济学工具仍然缺乏流畅性。本文在大型数据集的计量经济学中,特别侧重于斯帕克的后勤回归。我们审查斯帕克现有功能的稳健性,以适应后勤回归,并推出一个我们在皮斯帕克开发的包件,该包件返回了统计推理所必需的后勤回归统计摘要。