A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms. However, since deep learning is rapidly evolving, not only through theoretical advancements but also with respect to hardware and software engineering, ML frameworks often lose backward compatibility and introduce technical debt that can lead to bottlenecks and sub-optimal resource utilization. Moreover, the focus is in most cases not on deep learning engineering, but rather on new models and theoretical advancements. In this work, however, we focus on engineering, more specifically on the data loading pipeline in the PyTorch Framework. We designed a series of benchmarks that outline performance issues of certain steps in the data loading process. Our findings show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X. This allows for the use of the cloud-based, S3-like object storage for datasets, and have comparable training time as if datasets are stored on local drives.
翻译:越来越多的机器学习框架最近使更多的工程师、科学家和从业者能够直接使用复杂的神经网络架构和算法,从而让更多的工程师、科学家和从业者能够进入深层次学习。然而,由于深层次的学习正在迅速发展,不仅通过理论进步,而且在硬件和软件工程方面,ML框架往往失去后向兼容性,并引入技术债务,从而可能导致瓶颈和次最佳资源利用。此外,在多数情况下,重点是不是深层次学习工程,而是新的模型和理论进步。然而,在这项工作中,我们侧重于工程,更具体地侧重于PyTorrch框架中的数据处理管道。我们设计了一系列基准,概要说明数据装载过程某些步骤的性能问题。我们的调查结果显示,对于涉及许多文件的分类任务,如图像,培训墙时间可以大大改进。随着我们新的、经过修改的双轨数据装载器,我们可以在GPU的利用方面实现改进,并将批量装载时间减少到12X。这样可以使用云基的、S3类的物体存储数据,并具有可比的培训时间,如果数据存储在本地驱动器上。