In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57$\times$ faster compared to pandas and 1909$\times$ faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6$\times$ compared to pandas and 26.4$\times$ compared to modin.
翻译:近年来,DataFrame 库,如Pandas变得越来越流行。由于其灵活性,它们越来越多地用于特定的探索性数据分析(EDA)工作负载。这些工作负载是多样的,包括可以跨库或纯Python编写的自定义函数。目前,加速EDA工作负载的大部分系统都专注于大规模并行工作负载,这些工作负载包含截然不同的计算模式,通常在单个库内完成。因此,由于其昂贵的优化技术,它们可能会引入过高的开销,特别是对于特定的EDA工作负载而言,这些工作负载需要执行一些简单的任务。相反,我们认为程序重写是一种轻量级的技术,可以在避免减速的同时,提供大幅的加速。我们在 Dias 中实现了这些技术,可以将笔记本电脑单元格重写为更适用于特定的EDA工作负载。在 Dias中,我们开发了一些有效的重写技术,包括动态预检规则条件,该条件为保证重写的正确性,并且针对于笔记本环境进行了即时重写。我们显示出, Dias 可以将单个单元格重写为比Pandas快57倍,比modin优化系统快1909倍。此外,Dias可以将整个notebook加速高达3.6倍,比Pandas快26.4倍。