HeterPS:分布式深层学习与强化学习 (HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments)

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.

翻译：深神经网络(DNNS) 利用多个层次和大量参数实现优异的绩效。 DNN模式的培训过程通常处理大量输入数据,其功能非常稀少,造成投入/输出(IO)成本高,而有些层次是计算密集型的。培训过程通常利用分布式计算资源来减少培训时间。此外,分布式培训过程还有多种计算资源,例如CPU、多种类型的GPU。因此,将多个层次安排到不同的计算资源中,对于培训过程至关重要。要高效率地培训一个使用多种计算资源的DNNM模型,我们建议一个分布式框架,即Paddle-Heterput (IO) 服务器(Paddle-Out-Outectral-Oute-Orlational Perver Server), 以分布式计算方法分布式计算资源,例如CPU、 Paddledle-HPS 与现有框架相比,有三倍的优势。首先, Pater-HPS 能够对不同计算资源的不同状态进行高效的培训。第二, Padledle-heterPS 利用一个更高层次的存储/deal-deledledledleadleveldleltal 条件,同时利用一个分布式的存储期,同时, 数据流数据流数据流数据流数据流数据流数据流数据流数据流到Dral-dal-dal-dal-dal-drodudrodulex