In recent years, to sustain the resource-intensive computational needs for training deep neural networks (DNNs), it is widely accepted that exploiting the parallelism in large-scale computing clusters is critical for the efficient deployments of DNN training jobs. However, existing resource schedulers for traditional computing clusters are not well suited for DNN training, which results in unsatisfactory job completion time performance. The limitations of these resource scheduling schemes motivate us to propose a new computing cluster resource scheduling framework that is able to leverage the special layered structure of DNN jobs and significantly improve their job completion times. Our contributions in this paper are three-fold: i) We develop a new resource scheduling analytical model by considering DNN's layered structure, which enables us to analytically formulate the resource scheduling optimization problem for DNN training in computing clusters; ii) Based on the proposed performance analytical model, we then develop an efficient resource scheduling algorithm based on the widely adopted parameter-server architecture using a sum-of-ratios multi-dimensional-knapsack decomposition (SMD) method to offer strong performance guarantee; iii) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed schedule algorithm and its superior performance over the state of the art.
翻译:近年来,为了保持对深神经网络(DNN)培训的资源密集型计算需求,人们普遍认为,利用大规模计算群群群中的平行做法对于高效部署DNN培训工作至关重要,但传统计算群组的现有资源调度员并不完全适合DNN培训,这导致工作完成时间性能不尽人意。这些资源调度计划的局限性促使我们提出一个新的计算群群资源列表框架,能够利用DNN工作的特殊层结构,大大改善他们的工作完成时间。我们在本文件中的贡献有三重:一)我们通过考虑DNN的分层结构,开发一个新的资源调度分析模型,使我们能够对DNN培训在计算群中的资源调度优化问题进行分析性地拟订;二)根据拟议的业绩分析模型,我们随后根据广泛采用的参数-服务器结构,利用一个总和的多维-knappsack解剖法(SMD),开发一个高效的资源调度算法,以提供强有力的绩效保障。