Dias: Pandas 代码的动态重写 (Dias: Dynamic Rewriting of Pandas Code) - 专知论文

会员服务 ·

0

负载 · Pandas · EDA · 笔记本电脑 · 单元 ·

2023 年 3 月 28 日

Dias: Dynamic Rewriting of Pandas Code

翻译：Dias: Pandas 代码的动态重写

Stefanos Baziotis,Daniel Kang,Charith Mendis

from arxiv, 16 pages, 22 figures

In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57$\times$ faster compared to pandas and 1909$\times$ faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6$\times$ compared to pandas and 26.4$\times$ compared to modin.

翻译：近年来，DataFrame 库，如Pandas变得越来越流行。由于其灵活性，它们越来越多地用于特定的探索性数据分析（EDA）工作负载。这些工作负载是多样的，包括可以跨库或纯Python编写的自定义函数。目前，加速EDA工作负载的大部分系统都专注于大规模并行工作负载，这些工作负载包含截然不同的计算模式，通常在单个库内完成。因此，由于其昂贵的优化技术，它们可能会引入过高的开销，特别是对于特定的EDA工作负载而言，这些工作负载需要执行一些简单的任务。相反，我们认为程序重写是一种轻量级的技术，可以在避免减速的同时，提供大幅的加速。我们在 Dias 中实现了这些技术，可以将笔记本电脑单元格重写为更适用于特定的EDA工作负载。在 Dias中，我们开发了一些有效的重写技术，包括动态预检规则条件，该条件为保证重写的正确性，并且针对于笔记本环境进行了即时重写。我们显示出， Dias 可以将单个单元格重写为比Pandas快57倍，比modin优化系统快1909倍。此外，Dias可以将整个notebook加速高达3.6倍，比Pandas快26.4倍。

0

相关内容

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

63+阅读 · 2023年2月15日

【Manning2022新书】Python与PySpark的数据分析，458页pdf，Data Analysis with Python and PySpark

【Manning2022新书】Python与PySpark的数据分析，458页pdf，Data Analysis with Python and PySpark

专知会员服务

120+阅读 · 2022年3月20日

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

专知会员服务

102+阅读 · 2020年6月21日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

196+阅读 · 2020年2月1日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

令人沮丧的C++性能调试

令人沮丧的C++性能调试

InfoQ

0+阅读 · 2022年10月24日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】(Keras)LSTM多元时序预测教程

【推荐】(Keras)LSTM多元时序预测教程

机器学习研究会

24+阅读 · 2017年8月14日

多核环境下程序存储局部性检测与预测方法

国家自然科学基金

0+阅读 · 2014年12月31日

面向动态语言程序的缺陷理解研究

国家自然科学基金

0+阅读 · 2014年12月31日

长链非编码RNA-AI451557在移植排斥中对T细胞的调控作用及其机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于事件曝光模型的云服务测试与调试研究

国家自然科学基金

0+阅读 · 2012年12月31日

针对Android系统的Java/C++多语言接口建模与分析

国家自然科学基金

0+阅读 · 2012年12月31日

软件崩溃的分析，聚类和调试技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

动态云环境中基于SLA的工作流调度

国家自然科学基金

0+阅读 · 2012年12月31日

异构多核环境下支持系统软件可伸缩性的可控存储架构

国家自然科学基金

0+阅读 · 2011年12月31日

面向运行时监控的软件设计与验证理论研究

国家自然科学基金

1+阅读 · 2009年12月31日

肺腺癌外周血肿瘤细胞（CTCs）检测平台的建立及其对肺腺癌转移风险和预后的评估

国家自然科学基金

0+阅读 · 2009年12月31日

Adaptive choice of near-optimal expansion points for interpolation-based structure-preserving model reduction

Arxiv

0+阅读 · 2023年5月18日

Orthogonal polynomial approximation and Extended Dynamic Mode Decomposition in chaos

Arxiv

0+阅读 · 2023年5月17日

Which architecture should be implemented to manage data from the real world, in an Unreal Engine 5 simulator and in the context of mixed reality?

Arxiv

0+阅读 · 2023年5月16日

Survey of Malware Analysis through Control Flow Graph using Machine Learning

Arxiv

0+阅读 · 2023年5月15日

Causal Analysis for Robust Interpretability of Neural Networks

Arxiv

0+阅读 · 2023年5月15日

The Modern Mathematics of Deep Learning

Arxiv

49+阅读 · 2021年5月9日

Dynamic Neural Networks: A Survey

Arxiv

37+阅读 · 2021年2月10日

A Modern Introduction to Online Learning

A Modern Introduction to Online Learning

Arxiv

21+阅读 · 2019年12月31日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

66+阅读 · 2019年9月8日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

VIP会员

文章信息

相关主题

笔记本电脑

相关VIP内容

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

63+阅读 · 2023年2月15日

【Manning2022新书】Python与PySpark的数据分析，458页pdf，Data Analysis with Python and PySpark

【Manning2022新书】Python与PySpark的数据分析，458页pdf，Data Analysis with Python and PySpark

专知会员服务

120+阅读 · 2022年3月20日

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

【新书】人工智能Python代码，227页pdf，Python code for Artificial Intelligence: Foundations of Computational Agents

专知会员服务

102+阅读 · 2020年6月21日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

196+阅读 · 2020年2月1日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《美陆军特种作战条令》最新102页

《洛克希德SR-71“黑鸟”侦察机动力系统》21页slides

美空军作战实验室通过人工智能和指挥控制技术创新推进杀伤链

《指挥控制能力分析方法论》最新报告

相关资讯

令人沮丧的C++性能调试

令人沮丧的C++性能调试

InfoQ

0+阅读 · 2022年10月24日

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

【ACL2020放榜!】事件抽取、关系抽取、NER、Few-Shot 相关论文整理

深度学习自然语言处理

18+阅读 · 2020年5月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

LibRec 精选：推荐系统的论文与源码

LibRec 精选：推荐系统的论文与源码

LibRec智能推荐

14+阅读 · 2018年11月29日

利用动态深度学习预测金融时间序列基于Python

利用动态深度学习预测金融时间序列基于Python

量化投资与机器学习

18+阅读 · 2018年10月30日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】(Keras)LSTM多元时序预测教程

【推荐】(Keras)LSTM多元时序预测教程

机器学习研究会

24+阅读 · 2017年8月14日

相关论文

Adaptive choice of near-optimal expansion points for interpolation-based structure-preserving model reduction

Arxiv

0+阅读 · 2023年5月18日

Orthogonal polynomial approximation and Extended Dynamic Mode Decomposition in chaos

Arxiv

0+阅读 · 2023年5月17日

Which architecture should be implemented to manage data from the real world, in an Unreal Engine 5 simulator and in the context of mixed reality?

Arxiv

0+阅读 · 2023年5月16日

Survey of Malware Analysis through Control Flow Graph using Machine Learning

Arxiv

0+阅读 · 2023年5月15日

Causal Analysis for Robust Interpretability of Neural Networks

Arxiv

0+阅读 · 2023年5月15日

The Modern Mathematics of Deep Learning

Arxiv

49+阅读 · 2021年5月9日

Dynamic Neural Networks: A Survey

Arxiv

37+阅读 · 2021年2月10日

A Modern Introduction to Online Learning

A Modern Introduction to Online Learning

Arxiv

21+阅读 · 2019年12月31日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

66+阅读 · 2019年9月8日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

相关基金

多核环境下程序存储局部性检测与预测方法

国家自然科学基金

0+阅读 · 2014年12月31日

面向动态语言程序的缺陷理解研究

国家自然科学基金

0+阅读 · 2014年12月31日

长链非编码RNA-AI451557在移植排斥中对T细胞的调控作用及其机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于事件曝光模型的云服务测试与调试研究

国家自然科学基金

0+阅读 · 2012年12月31日

针对Android系统的Java/C++多语言接口建模与分析

国家自然科学基金

0+阅读 · 2012年12月31日

软件崩溃的分析，聚类和调试技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

动态云环境中基于SLA的工作流调度

国家自然科学基金

0+阅读 · 2012年12月31日

异构多核环境下支持系统软件可伸缩性的可控存储架构

国家自然科学基金

0+阅读 · 2011年12月31日

面向运行时监控的软件设计与验证理论研究

国家自然科学基金

1+阅读 · 2009年12月31日

肺腺癌外周血肿瘤细胞（CTCs）检测平台的建立及其对肺腺癌转移风险和预后的评估

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员