建立深入学习建议培训模式的绩效模型 (Building a Performance Model for Deep Learning Recommendation Model Training on GPUs)

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.

翻译：我们为深学习建议模型(DLRM)的GPU培训设计了一个性能模型,GPU的利用率与其他优化的 CV 和 NLP 模型相比较低。我们显示,设备运行时间(内核运行时间总和)以及设备闲置时间都是整个设备运行时间的重要组成部分。因此,我们分别处理这些模型:(1) 灵活地为主导设备运行时间的操作员采用基于超光速和基于ML的内核性能模型,(2) 将操作员的间接费用分类为五种类型,以便量化地确定其对设备运行时间的贡献。结合这两部分,我们提出一个基于关键路径的算法,以预测设备运行时间(内核运行时间总和内核运行模式)的每批培训时间(内空运行时间总和内空模式),我们所有内存性能模型(GMAE)的平均平均误差不到10%,GPUPU的4.61%和7.96%的地缘性能误差(E2E-batch) 培训时间的预测可分别从单个工作量中确定。在E2号模型中略有增加2.19%的准确度分析。在E2E预测中,在E2E级的预测中,在高空测算中,在高空预测模型中,在高空预测中则通过高空测测测算方法显示的多数的多数的多数的多数预测数据中显示,在高空测算法中,在高空测算法中只能显示,在高空测算法中,在高空数据中只能显示高空数据。