Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over 2 million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis reveals that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interprettable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 46x for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 40% compared to state-of-the-art tuners.
翻译:对谷歌数据中心200多万个ML职位的分析显示,很大一部分示范培训工作可以受益于更快的输入数据管道。与此同时,我们的分析表明,大多数工作并不饱和主机硬件,指向基于软件的瓶颈点。根据这些发现,我们建议Plumber,这是寻找ML输入管道瓶颈的一个工具。Plumber使用一种可扩展和可解释的操作分析分析模型,自动调节平行、预展和在主机资源限制下缓冲。在5个具有代表性的ML管道中,Plumber获得46x的错配置管道加速器。通过自动加固,Plumber获得40%以上的终端至终端加速器。