Query languages in general and SQL in particular are arguably one of the most successful programming interfaces. Yet, in the domain of high-energy physics (HEP), they have found limited acceptance. This is surprising since data analysis in HEP matches the SQL model well: it is fully structured data queried using combinations of selections, projections, joins, and reductions. To gain insights on why this is the case, in this paper we perform an exhaustive performance and functionality analysis of several data processing platforms (Amazon Athena, Google Big Query, Presto, Rumble) and compare them to the new RDataFrame interface of the ROOT framework, the most commonly used system by particle physicists today. The goal of the analysis is to identify the potential advantages and shortcomings of each system considering not only performance but also cost for cloud deployments, suitability of the query dialect, and resulting query complexity. The analysis is done using a HEP workload: the Analysis Description Languages (ADL) benchmark, created by physicists to capture representative aspects of their data processing tasks. The evaluation of these systems results in an interesting and rather complex picture of existing solutions: those offering the best possibilities in terms of expressiveness, conciseness, and usability turn out to be the slowest and most expensive; the fastest ones are not the most cost-efficient and involve complex queries; RDataFrame, the baseline we use as a reference, is often faster and cheaper but is currently facing scalability issues with large multi-core machines. In the paper, we analyze all the aspects that lead to such results and discuss how systems should evolve to better support HEP workloads. In the process, we identify several weaknesses of existing systems that should be relevant to a wide range of use cases beyond particle physics.
翻译:在高能物理(HEP)领域,它们往往发现接受度有限。这令人惊讶,因为HEP的数据分析与SQL模型非常匹配:它是完全结构化的数据:它使用选择、预测、组合和削减的组合来查明每个系统的潜在优点和缺点,不仅考虑到云的部署成本,还考虑到查询方方言的适合性,并由此导致问题的复杂性。为了了解为何如此,我们在本文件中对若干数据处理平台(Amazon Athena、Google Big Query、Presto、Rumble)进行了详尽的性能和功能分析。在高能物理家为获取其数据处理任务的具有代表性的方面而创建的分析语言(ADL)基准,并把它们与新的ROOT框架的RDataFrame界面比较,这是今天粒子物理学物理学家最常用的系统。 分析的目的是查明每个系统的潜在优点和缺点,不仅考虑到云层的性,而且考虑到云层的适合性,而且导致查询的复杂性。 分析语言(ADL)在物理学家为收集其数据处理任务具有代表性的参考而创建的参数时,这些结果。这些系统的评估结果是目前最令人感兴趣的和最令人感兴趣和最复杂的分析性、最复杂的可能性,我们使用最复杂的研究过程。