Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines. In this paper, we provide an in-depth analysis of data preprocessing pipelines from four different machine learning domains. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Additionally, we provide an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying our generated insights to real-world use-cases, we obtain an increased throughput of 3x to 13x compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.
翻译:深层学习的预处理管道旨在提供足够的数据输送量,使培训过程保持繁忙; 最大限度地利用资源正变得越来越具有挑战性,因为培训过程的吞吐量随着硬件创新(例如更快的GPU、TPU和相互连接)和先进的平行技术的提高而增加,从而产生更好的伸缩性; 同时,培训日益复杂的模型所需的培训数据数量正在增加; 由于这一发展,数据处理预处理和提供正在成为端至端深层学习管道中的一个严重瓶颈; 在本文件中,我们深入分析了四个不同机器学习领域的数据处理预处理管道; 我们从新的视角,高效率地准备用于端至端深层学习管道的数据集,并提取个别的权衡,以优化吞吐量、预处理时间和储存消耗; 此外,我们提供了一个开放源的剖析图书馆,可以自动决定适当的预处理战略以最大限度地吞吐量。 通过将我们生成的洞察力应用于现实世界使用案例,我们获得了对来自四个不同机器学习领域的预处理管道的深度分析。 我们从3x到13x的吞吐量增加了,与巨大的功能性输管系统相比,我们展示了巨大的重复的输管结果。