GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when deciding on how to choose the proper hardware for training. In particular, CPU servers can be beneficial if training on CPUs was more efficient, as they incur fewer hardware update costs and better utilizing existing infrastructure. This paper makes several contributions to research on training deep learning models using CPUs. First, it presents a method for optimizing the training of deep learning models on Intel CPUs and a toolkit called ProfileDNN, which we developed to improve performance profiling. Second, we describe a generic training optimization method that guides our workflow and explores several case studies where we identified performance issues and then optimized the Intel Extension for PyTorch, resulting in an overall 2x training performance increase for the RetinaNet-ResNext50 model. Third, we show how to leverage the visualization capabilities of ProfileDNN, which enabled us to pinpoint bottlenecks and create a custom focal loss kernel that was two times faster than the official reference PyTorch implementation.
翻译:GPU因其高度平行结构而有利于培训深层次学习模式。因此,大多数关于培训优化的研究都以GPU为重点。然而,在决定如何选择适当的培训硬件时,往往要权衡成本和效率。特别是,如果CPU服务器提高CPU培训的效率,因为硬件更新成本较低,并且更好地利用现有基础设施,那么,CPU服务器就更有利于培训深层次学习模式。本文为利用CPU培训深层次学习模式的研究作出了几项贡献。首先,它为优化Intel CPU和名为PificDNNN的深层次学习模式培训提供了一种方法,我们开发了这个工具是为了改进业绩分析。第二,我们描述了一种通用的培训优化方法,指导我们的工作流程,并探索了几个案例研究,我们查明了绩效问题,然后优化了PyTorrch的Intel扩展,从而使得RetinnetNet-ResNext50模式的总体培训绩效提高2x。第三,我们展示了如何利用GPIPIDDNNN的可视化能力,使我们能够确定瓶颈,并创建了比官方参考PyTorch执行速度快2倍。