Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. To reduce this training cost and optimize the cluster-wide hardware resource usage, we analyze GPU cluster usage statistics from a well-known research institute. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely underutilizing the hardware. This is because DL researchers and practitioners often lack the required expertise to independently optimize their own workloads. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators, and then trains those models simultaneously on a shared accelerator. On three emerging DL training workloads and state-of-the-art accelerators (GPUs and TPUs), HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to $15.1 \times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
翻译:在研究新的深层次学习(DL)算法的巨大努力的推动下,开发新模式的培训成本近年来惊人地增加。为了降低培训成本和优化全组硬件资源的使用效率,我们从一个知名的研究机构分析了GPU集使用统计数据。我们的研究显示,单加速器培训工作在重复启动(例如超光量调试)的同时,可以主导整个组群的资源消耗,同时严重低估硬件。这是因为DL研究人员和从业人员往往缺乏独立优化自身工作量所需的专门知识。幸运的是,我们观察到,这种工作量具有以下独特的利用性:(一) 各种工作之间的模型往往具有相同形状的操作者类型,以及(二) 这些操作者之间的模型横向融合在数学上相当于其他已经非常精准的操作者。因此,为了帮助DL研究人员和从业人员有效地提高他们新的DL值培训的硬件利用率,我们建议通过横向培训强大的Array(HFTA) 。HFTA是一个新的DL框架扩展图书馆,然后在高级操作者与高级操作者之间同时运行一个水平模型。