Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. The shift to cloud computing requires optimization across all pipelines concurrently running across a cluster. In this paper, we look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. Efficient data placement is challenging since I/O density is usually unknown at the time data needs to be placed. Additionally, external factors such as load variability, job preemption, or job priorities can impact job completion times, which ultimately affect the I/O density of the temporary files in the workload. In this paper, we envision that machine learning can be used to solve this problem. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that I/O density may be predictable. This suggests that learning-based strategies, if crafted carefully, could extract predictive features for I/O density of temporary files involved in various transformations, which could be used to improve the efficiency of storage management in data processing pipelines.
翻译:Apache Beam和Apache Spark等数据处理框架用于从日志分析到DNN培训数据准备等范围广泛的应用,从日志分析到DN培训的数据准备等,因此,在优化这些框架包括储存管理方面有大量工作,因此,毫不奇怪,向云计算转变需要同时通过一个组群运行的所有管道优化。在本文件中,我们研究这一问题的一个具体实例:在SSD和HDDD上放置I/O密集型临时中间数据。高效数据定位具有挑战性,因为通常在需要放置数据时I/O密度未知。此外,负荷变异性、工作预设或工作优先事项等外部因素可能会影响工作完成时间,最终会影响工作量中临时文件的I/O密度。在本文件中,我们设想机器学习可用于解决这一问题。我们分析谷歌数据中心的一系列数据处理管道的生产日志。我们的分析表明,I/O密度可能是可以预测的。这意味着,学习战略,如果精心设计,可以提取临时文件的I/O密度的预测性特征,用于各种管道转换中所使用的数据管理效率。