Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus far. To this end, we conduct an initial analysis to examine the feasibility of on-device training on smartphones using mobile GPUs. We first employ the open-source mobile DL framework (MNN) and its OpenCL backend for running compute kernels on GPUs. Next, we observed that training on CPUs is much faster than on GPUs and identified two possible bottlenecks related to this observation: (i) computation and (ii) memory bottlenecks. To solve the computation bottleneck, we optimize the OpenCL backend's kernels, showing 2x improvements (40-70 GFLOPs) over CPUs (15-30 GFLOPs) on the Snapdragon 8 series processors. However, we find that the full DNN training is still much slower on GPUs than on CPUs, indicating that memory bottleneck plays a significant role in the lower performance of GPU over CPU. The data movement takes almost 91% of training time due to the low bandwidth. Lastly, based on the findings and failures during our investigation, we present limitations and practical guidelines for future directions.
翻译:深度学习( DL) 在许多移动应用程序中表现出了令人印象深刻的绩效。 大部分现有工程都集中在减少运行深神经网络的计算和资源间接成本。 但是, DNN 操作的另一方面, 即智能手机 GPU 培训( 前向和后向通行证) 至今没有受到多少关注。 为此, 我们进行了初步分析, 以审查使用移动 GPU 进行智能手机设备培训的可行性。 我们首先使用开放源的移动 DL 框架( MNN) 及其 OpenCL 后端来计算 GPU 上的资源内核。 接下来, 我们观察到, CPN 的培训比 GPU 上的培训要快得多, 与这项观察相关的两个可能的瓶颈:(i) 计算和(ii) 记忆瓶颈。 为了解决计算瓶颈, 我们优化了 Op CLL后端的内核内核, 显示 2x 改进 (40- 70 GFLLOPs) 超过 CPU 的 CLF 15- 30 GLFOPs 。 然而, 我们发现, 在CDLPU 8 的低时空训练中, 在GPU 的低时程中, 中, 我们发现一个较慢的后端的运行中, 的后端的运行中, 显示在GPUPU值中, 的后端的完整中, 运行中, 我们的完整中, 的完整。