Background: The energy consumption of machine learning and its impact on the environment has made energy efficient ML an emerging area of research. However, most of the attention stays focused on the model creation and the training and inferencing phase. Data oriented stages like preprocessing, cleaning and exploratory analysis form a critical part of the machine learning workflow. However, the energy efficiency of these stages have gained little attention from the researchers. Aim: Our study aims to explore the energy consumption of different dataframe processing libraries as a first step towards studying the energy efficiency of the data oriented stages of the machine learning pipeline. Method: We measure the energy consumption of 3 popular libraries used to work with dataframes, namely Pandas, Vaex and Dask for 21 different operations grouped under 4 categories on 2 datasets. Results: The results of our analysis show that for a given dataframe processing operation, the choice of library can indeed influence the energy consumption with some libraries consuming 202 times lesser energy over others. Conclusion: The results of our study indicates that there is a potential for optimizing the energy consumption of the data oriented stages of the machine learning pipeline and further research is needed in the direction.
翻译:目标:我们的研究旨在探索不同数据框架处理图书馆的能源消耗情况,作为研究机器学习管道数据导向阶段的能源效率的第一步。方法:我们衡量用于数据框架的3个流行图书馆的能源消耗情况,即Pandas、Vaex和Dask,用于按2个数据集分为4类的21个不同操作,结果:我们的分析结果显示,对于某一数据框架处理作业,图书馆的选择确实能够影响能源消耗,而某些图书馆消耗的能源比其他图书馆少202倍。结论:我们的研究结果表明,有可能优化机器学习管道数据导向阶段的能源消耗,并需要在这方面进行进一步研究。