北极光:对SparkSQL大气数据集的宣布和优化分析 (Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL)

Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datasets (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community -- along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine -- resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.

翻译：进行数据密集分析是现代地球科学的一个基本部分。因此,大气物理学和气象学研究经常需要处理大量的观测和(或)模型数据集。通常,这些数据集(a)具有高度的维度,即包含每个空间点的各种测量数据,(b)非常大,含有长期观测数据。此外,(c)这些数据集的分析工作结构复杂。多年来,在地球科学界传播和交换这种多维流数据集方面,双向格式的NetCDF经常要求处理非常大型的观测和(或)模型化数据集。一般来说,这些数据集(a)具有高度的维度,即包含每个空间点的各种测量数据,(b)非常之大,包含长期内存的观测数据。此外,(c) 对这些数据集进行的分析工作是结构性的。(通常不是计算机科学家),在与这些数据集的日常工作方面,不必要地进行脱轨的难度。因此,在这项工作中,与工具的可比较性流程中,我们的目标是将S-C的精确性分析与S-RODF系统进行,从而将S的精确性数据流与S的常规化数据系统整合,从而将S-我们又将S-