In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN Training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of ``data storage, loading pipeline'' and ``training framework'' and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.
翻译:本文主要关注云端深度神经网络(DNN)训练中的数据预处理流程。首先,我们通过使用原始数据或记录文件两种主要数据预处理方法运行实验以测试其性能影响。初步结果表明,即使使用了NVIDIA DALI这种高度优化的数据预处理库,数据预处理仍然是明显的瓶颈。其次,我们确定可能的原因,运用各种优化方法,并介绍它们的利弊。我们希望这项工作能够为新的“数据存储、预处理流程”和“训练框架”之间的协同设计以及它们之间的灵活资源配置提供指导,以便充分利用资源,最大化性能。